-
Key difference between Adam and SGD is that Adam doesn’t use gradients excessively → it instead takes the partial derivative of the cost function in relation to some parameter
- So, it really uses gradients in different locations of parameter space (gradients over different parameters)
- It also uses adaptable learning rates for each parameter → and furthermore, uses moments as opposed to sums of gradients
- Also, Adam ACTUALLY PERFORMS STEP SIZE ANNEALING → it creates a subfunction of step sizes and finds the optimal step size as well
-
Formula for Adam optimization
$$
Δ_j^{(t)}=\frac{-a}{√(v^{Δ(t)}_j)+ε}*m^{Δ(t)}_j
$$
-
Terms
- $a$ = learning rate/stepsize
- $Δ^{(t)}_j$ is the step taken by Adam in the $t^{th}$ iteration, in the dimension (with respect to) the $j^{th}$ parameter
- $ε$ is there to prevent division by zero
- $m^{Δ(t)}_j$ is the FIRST MOMENT → the exponential moving average of the partial derivatives across $t$
- More accurately → it’s the first moment (mean) of all the partial derivatives
- $v^{Δ(t)}_j$→ this is the second RAW moment (the variance + the mean!)
- In other words, it’s just the mean distance from the origin
- When we square root this, we just get its square root! (the uncentered standard deviation)
- This is really the exponential moving average of the SQUARES the derivatives
- Note that THE GRADIENTS THEMSELVES ARE NOT USED
- Instead, we’re using the first and second moments of the gradients to make the updates
-
One key benefit is that it’s invariant to diagonal gradient scaling
- Let’s say that we multiply a constant diagonal matrix $C$ with both the gradients (scaling both the gradients)
- Since epsilon is basically zero ($10^{-8}$), we can cancel out both Cs → while the gradients will be affected, the STEP SIZE REMAINS NEARLY IDENTICAL!
-
Another benefit is that the function is (approximately) bounded by the learning rate * the first moment
-
Furthermore, the objective DOES NOT NEED TO BE STATIONARY (identical/have the same variance/repeat themselves)
- They can be almost random and this optimizer will still work