Key difference between Adam and SGD is that Adam doesn’t use gradients excessively → it instead takes the partial derivative of the cost function in relation to some parameter
Formula for Adam optimization
$$ Δ_j^{(t)}=\frac{-a}{√(v^{Δ(t)}_j)+ε}*m^{Δ(t)}_j $$
Terms
One key benefit is that it’s invariant to diagonal gradient scaling
Another benefit is that the function is (approximately) bounded by the learning rate * the first moment
Furthermore, the objective DOES NOT NEED TO BE STATIONARY (identical/have the same variance/repeat themselves)
We’re really attempting to minimize the expected value of the cost function (the mean)
Realization of a function → if we have a random variable $x$, then the different random values that it takes on are known as REALIZATIONS
Basically, $B_1+B_2$ control the exponential DECAY RATES OF THE FUNCTIONS
Efficiency of the end algorithm can be improved by combining the last couple of steps into the same one → simply rationalize and simplify into the following:
The step size has two upper bounds depending on the condition
The first bound → when $1-B_1>√(1-B_2)$ (when $B_1$ is further from 1 than the square root of the distance of $B_2$ and 1), then the end bound becomes $a(1-B_1)/√(1-B_2)$
This bound only occurs when the gradient has been extremely sparse → zero at all timesteps with the exception of the current one
The reason for this is that it makes both the first and second bias corrected estimates extremely small → but, as we square root the second estimate, it almost guarantees that, given enough sparsity, the first estimate will be larger than the second (and thus the error bound will basically be the entire step equation assuming epsilon is zero or extremely close to it)
Proof of this:
The second bound is simply the learning rate or step size!