- 
Key difference between Adam and SGD is that Adam doesn’t use gradients excessively → it instead takes the partial derivative of the cost function in relation to some parameter 
- So, it really uses gradients in different locations of parameter space (gradients over different parameters)
- It also uses adaptable learning rates for each parameter → and furthermore, uses moments as opposed to sums of gradients
- Also, Adam ACTUALLY PERFORMS STEP SIZE ANNEALING → it creates a subfunction of step sizes and finds the optimal step size as well
 
- 
Formula for Adam optimization $$
Δ_j^{(t)}=\frac{-a}{√(v^{Δ(t)}_j)+ε}*m^{Δ(t)}_j
$$ 
- 
Terms 
- $a$ = learning rate/stepsize
- $Δ^{(t)}_j$ is the step taken by Adam in the $t^{th}$  iteration, in the dimension (with respect to) the $j^{th}$ parameter
- $ε$ is there to prevent division by zero
- $m^{Δ(t)}_j$ is the FIRST MOMENT → the exponential moving average of the partial derivatives across $t$
- More accurately → it’s the first moment (mean) of all the partial derivatives
 
- $v^{Δ(t)}_j$→ this is the second RAW moment (the variance + the mean!)
- In other words, it’s just the mean distance from the origin
- When we square root this, we just get its square root! (the uncentered standard deviation)
- This is really the exponential moving average of the SQUARES the derivatives
 
- Note that THE GRADIENTS THEMSELVES ARE NOT USED
- Instead, we’re using the first and second moments of the gradients to make the updates
 
 
- 
One key benefit is that it’s invariant to diagonal gradient scaling 
- Let’s say that we multiply a constant diagonal matrix $C$ with both the gradients (scaling both the gradients)
- Since epsilon is basically zero ($10^{-8}$), we can cancel out both Cs → while the gradients will be affected, the STEP SIZE REMAINS NEARLY IDENTICAL!
 
- 
Another benefit is that the function is (approximately) bounded by the learning rate * the first moment 
- 
Furthermore, the objective DOES NOT NEED TO BE STATIONARY (identical/have the same variance/repeat themselves) 
- They can be almost random and this optimizer will still work