Tutorial 96 - Deep Learning terminology explained - Back propagation and optimizers

Аватар автора
Питоновый транслятор
The essence of deep learning is to find best weights (and biases) for the network that minimizes error (loss). This is done via an iterative process where weights are updated in each iteration in a direction that minimizes the loss. To find this direction we need the slope of the loss function for a given weight. This is achieved by computing the derivative (gradient). It is computationally expensive to compute derivatives for millions of weights. Backpropagation makes this computation possible by using the chain rule in calculus  finding the derivative of loss for every weight in the network. Gradient descent is a general term for calculating the gradient and updating the weights. __________________________________________ In Gradient Decent (GD) optimization, the weights are updated incrementally after each epoch (= after it sees the entire training dataset). For large datasets this gets computationally expensive. In Stochastic Gradient Descent (SGD), the weights are updated after each training sample (or a mini batch). Stochastic refers the fact that the gradient is based on a small sample and it is a stochastic approximation of the real gradient. Due to its stochastic nature, the path towards the global minimum can be bumpy, but it always heads towards a minima. _____________________________________________ Adam optimizer: Computationally efficient. Little memory requirements. Well suited for problems that are large in terms of data and/or parameters.

0/0


0/0

0/0

0/0