ADAM - A METHOD FOR STOCHASTIC OPTIMIZATION.

Understanding SGD
- Used in many fields of engineering → fundamental purpose is to maximize or minimize some objective (loss) function with respect to parameters
  - These loss functions are often stochastic → they are really just subfunctions that have been evaluated at different points during the training cycle
  - So, by using the GRADIENTS, we can take “steps” within these subfunctions → these functions can also be random due to things like dropout (increasing variance)
  - For noisy objectives, more efficient optimization techniques are needed → consider HIGHER DIMENSIONAL PARAMETERS.
    - For instance, we had a large parameter space, this means that there are more inputs and features → obviously increasing the complexity of the model
    - In these cases, optimization methods that compute higher order gradients simply do not work well → this would quite obviously be computationally and time expensive
    - As a result, the challenge is to find a way to EFFICIENTLY optimize networks with HIGHER DIMENSIONS by calculating ONLY FIRST ORDER DERIVATIVES → THIS IS WHAT ADAM DOES.
- What is Adam?
  - Adam is really a method of SGD that aims to arrive at the same (or better) results in a shorter period of time → or, better results with more training
  - The algorithm only requires computing first-order gradients → therefore being easy on memory. It additionally computes individual, ADAPTIVE LEARNING RATES for different parameters and does this by estimating gradient “moments”
    - It uses raw first and second moments → that is, both of the moments are calculated without subtracting the mean from various values of $x$
    - More accurately, the moments are computed with respect to the gradients
      - So, the a) mean value of the gradients and the b) variance (?) of the gradients is calculated
      - What is the second raw moment?
        
        We know that the second raw moment is the same thing as $E[(X-0)^2]$ (no subtracting the mean)
        
        We also know that variance is $E(X-E(X))^2$ → the first moment is the mean, so we’re just subtracting the mean and then squaring
        
        If we simply expand this, we’ll end up with $E(X^2)-E(X)^2$ AS THE VARIANCE!
        
        We know that the second raw moment when simplified is the same thing as $E(X^2)$ → which is the same thing as $Variance+E(X)^2$
        
        So, THE SECOND RAW MOMENT IS JUST THE VARIANCE PLUS THE SQUARE OF THE EXPECTED VALUE (THE MEAN)!
  - Meant to take advantage of the benefits of adagrad and RMSProp → the first works with sparse gradients (gradients that aren’t “strong” enough and contain mostly zero values) and the second works well with on-line and non-stationary settings
    - Basically → in settings where variance and correlation structure don’t change over time
      - So, the function doesn’t suddenly become periodic or something similar
    - Stationary POINTS MEAN POINTS WHERE THE DERIVATIVE IS ZERO → the latter optimization works well when the derivative is not zero
- Adam is invariant to DIAGONAL RESCALAING
  - Key difference between Adam and SGD is that Adam doesn’t use gradients excessively → it instead takes the partial derivative of the cost function in relation to some parameter
    - So, it really uses gradients in different locations of parameter space (gradients over different parameters)
    - It also uses adaptable learning rates for each parameter → and furthermore, uses moments as opposed to sums of gradients
    - Also, Adam ACTUALLY PERFORMS STEP SIZE ANNEALING → it creates a subfunction of step sizes and finds the optimal step size as well
  - Formula for Adam optimization
    
    $$ Δ_j^{(t)}=\frac{-a}{√(v^{Δ(t)}_j)+ε}*m^{Δ(t)}_j $$
  - Terms
    - $a$ = learning rate/stepsize
    - $Δ^{(t)}_j$ is the step taken by Adam in the $t^{th}$ iteration, in the dimension (with respect to) the $j^{th}$ parameter
    - $ε$ is there to prevent division by zero
    - $m^{Δ(t)}_j$ is the FIRST MOMENT → the exponential moving average of the partial derivatives across $t$
      - More accurately → it’s the first moment (mean) of all the partial derivatives
    - $v^{Δ(t)}_j$→ this is the second RAW moment (the variance + the mean!)
      - In other words, it’s just the mean distance from the origin
      - When we square root this, we just get its square root! (the uncentered standard deviation)
      - This is really the exponential moving average of the SQUARES the derivatives
    - Note that THE GRADIENTS THEMSELVES ARE NOT USED
      - Instead, we’re using the first and second moments of the gradients to make the updates
  - One key benefit is that it’s invariant to diagonal gradient scaling
    - Let’s say that we multiply a constant diagonal matrix $C$ with both the gradients (scaling both the gradients)
    - Since epsilon is basically zero ($10^{-8}$), we can cancel out both Cs → while the gradients will be affected, the STEP SIZE REMAINS NEARLY IDENTICAL!
  - Another benefit is that the function is (approximately) bounded by the learning rate * the first moment
  - Furthermore, the objective DOES NOT NEED TO BE STATIONARY (identical/have the same variance/repeat themselves)
    - They can be almost random and this optimizer will still work
- Algorithm in detail
  - Steps:
    - Requirements → stepsize, $B_1,B_2,0<=B<1$, $θ_0$ (params), and $f(θ)$ (cost function of parameters) in addition to epsillon
    - $m_0=0,v_0=0,t=0$ (initialize moments and timesteps)
    - While not converged
      - $t=t+1$
      - $g_t=∇f_t(θ_{t-1})$ → take the gradient of the cost function at the previous set of parameters
      - $m_t=B_1*m_{t-1}+(1-B_1)*g_t$ → find the first moment (estimating first moment)
        
        The bias ensures that the total adds up to the same whole as one moment → the gradient
        
        In this case, $m^{Δ}_t$ is the latter part → the CHANGE IN THE FIRST MOMENT (same with second moment)
      - $v_t=B_2v_{t-1}(1-B_2)*g^2$
        
        Find the new second moment and bias it → the second moment is $g^2$
        
        REMEMBER → $g$ IS A VECTOR
        
        So, we’re summing over g, and adding WEIGHTED SUMS TO EACH ONE
        
        $g$ is our probability density function!
        
        And, $(1-B)$ is the weight for each one → THUS GIVING US THE SECOND MOMENT!
        
        For the first one, ****this was a weighted mean with respect to the previous vector → this is a weighted mean squared! (distance from gradient to previous value in a sense)
        
        So → THIS ENTIRE LINE IS ESTIMATING A MOMENT
        
        We’re taking the averaged (biased) moments of all previous vectors, and then averaging THOSE WITH THE NEW ONE!
        
        We’re basically estimating a new moment USING PREVIOUS MOMENTS
        
        $m^{Δ}_t=m_t/(1-B_1^t)$ → update estimate while correcting for bias
        
        Bias correction and why it’s important
        
        Basically → normally, $B_2$ is closer to one than $B_1$, meaning that the second moment does not change as much as the first moment
        
        So, in the first few terms where THERE ARE NO PREVIOUS ONES, the end update would be huge.
        
        It’s simple → $m_1$ would be, due to a lower bias, larger than $v_1$
        
        As the formula takes the square root AND THEN DIVIDES, this would mean that the update would be extremely large for the first couple of steps when there’s a really small $v_{t-1}$
        
        On the other hand, if we just used the gradients as the updates directly, then the entire term would become less sensitive to the biases as we aren’t using them → we’re basically discarding them in the end step calculation
        
        Note that the optimizer uses the moving average to estimate the first and second moments
        
        THE MOVING AVERAGE IS THE EXACT SAME EQUATION AS $m_t$ and $v_t$!
        
        We’re incorporating the weighted average of the new and old points together
        
        Why can this be used to estimate a moment?
        
        Simple → it’s a WEIGHTED SUM!
        
        This itself is the moment formula → recall that the first term is just the weighted sum of all the other gradient biased sums
        
        Therefore, as we aren’t using the exact values of the previous terms, this becomes a estimation of what the moments would be at that particular time
        
        NOTE → THE BIASES ARE ACTIVELY DECAYING
        
        The biases get smaller and smaller as they are raised to the exponent $^T$
        
        As a result, over a long period of time, the denominator becomes larger and larger → therefore meaning that the updates become smaller and smaller
      - $v^{Δ}_t=v_t/(1-B_2)$ → same process but for the second moment
      - $θ_t=θ_{t-1} - \frac{a}{√(v^{Δ(t)}_j)+ε}*m^{Δ(t)}_j$ → update parameters
        
        The second moment needs to be taken the square root of as this parameter otherwise means we’re dividing by the SQUARED GRADIENTS → preventing useful updates
        
        Instead, we take the square root of the second moment estimation (the biased corrected one) to give a more realistic estimate of the second moment (almost the variance!)
Information about the algorithm
Bias Initializations
Analyzing Convergence
Key Terms:
PRACTICAL IMPLEMENTATION NOTES