• Understanding SGD
    • Used in many fields of engineering → fundamental purpose is to maximize or minimize some objective (loss) function with respect to parameters
      • These loss functions are often stochastic → they are really just subfunctions that have been evaluated at different points during the training cycle
      • So, by using the GRADIENTS, we can take “steps” within these subfunctions → these functions can also be random due to things like dropout (increasing variance)
      • For noisy objectives, more efficient optimization techniques are needed → consider HIGHER DIMENSIONAL PARAMETERS.
        • For instance, we had a large parameter space, this means that there are more inputs and features → obviously increasing the complexity of the model
        • In these cases, optimization methods that compute higher order gradients simply do not work well → this would quite obviously be computationally and time expensive
        • As a result, the challenge is to find a way to EFFICIENTLY optimize networks with HIGHER DIMENSIONS by calculating ONLY FIRST ORDER DERIVATIVES → THIS IS WHAT ADAM DOES.
    • What is Adam?
      • Adam is really a method of SGD that aims to arrive at the same (or better) results in a shorter period of time → or, better results with more training
      • The algorithm only requires computing first-order gradients → therefore being easy on memory. It additionally computes individual, ADAPTIVE LEARNING RATES for different parameters and does this by estimating gradient “moments”
        • It uses raw first and second moments → that is, both of the moments are calculated without subtracting the mean from various values of $x$
        • More accurately, the moments are computed with respect to the gradients
          • So, the a) mean value of the gradients and the b) variance (?) of the gradients is calculated
          • What is the second raw moment?
            • We know that the second raw moment is the same thing as $E[(X-0)^2]$ (no subtracting the mean)
            • We also know that variance is $E(X-E(X))^2$ → the first moment is the mean, so we’re just subtracting the mean and then squaring
              • If we simply expand this, we’ll end up with $E(X^2)-E(X)^2$ AS THE VARIANCE!
            • We know that the second raw moment when simplified is the same thing as $E(X^2)$ → which is the same thing as $Variance+E(X)^2$
            • So, THE SECOND RAW MOMENT IS JUST THE VARIANCE PLUS THE SQUARE OF THE EXPECTED VALUE (THE MEAN)!
      • Meant to take advantage of the benefits of adagrad and RMSProp → the first works with sparse gradients (gradients that aren’t “strong” enough and contain mostly zero values) and the second works well with on-line and non-stationary settings
        • Basically → in settings where variance and correlation structure don’t change over time
          • So, the function doesn’t suddenly become periodic or something similar
        • Stationary POINTS MEAN POINTS WHERE THE DERIVATIVE IS ZERO → the latter optimization works well when the derivative is not zero
    • Adam is invariant to DIAGONAL RESCALAING
      • Key difference between Adam and SGD is that Adam doesn’t use gradients excessively → it instead takes the partial derivative of the cost function in relation to some parameter

        • So, it really uses gradients in different locations of parameter space (gradients over different parameters)
        • It also uses adaptable learning rates for each parameter → and furthermore, uses moments as opposed to sums of gradients
        • Also, Adam ACTUALLY PERFORMS STEP SIZE ANNEALING → it creates a subfunction of step sizes and finds the optimal step size as well
      • Formula for Adam optimization

        $$ Δ_j^{(t)}=\frac{-a}{√(v^{Δ(t)}_j)+ε}*m^{Δ(t)}_j $$

      • Terms

        • $a$ = learning rate/stepsize
        • $Δ^{(t)}_j$ is the step taken by Adam in the $t^{th}$ iteration, in the dimension (with respect to) the $j^{th}$ parameter
        • $ε$ is there to prevent division by zero
        • $m^{Δ(t)}_j$ is the FIRST MOMENT → the exponential moving average of the partial derivatives across $t$
          • More accurately → it’s the first moment (mean) of all the partial derivatives
        • $v^{Δ(t)}_j$→ this is the second RAW moment (the variance + the mean!)
          • In other words, it’s just the mean distance from the origin
          • When we square root this, we just get its square root! (the uncentered standard deviation)
          • This is really the exponential moving average of the SQUARES the derivatives
        • Note that THE GRADIENTS THEMSELVES ARE NOT USED
          • Instead, we’re using the first and second moments of the gradients to make the updates
      • One key benefit is that it’s invariant to diagonal gradient scaling

        • Let’s say that we multiply a constant diagonal matrix $C$ with both the gradients (scaling both the gradients)
        • Since epsilon is basically zero ($10^{-8}$), we can cancel out both Cs → while the gradients will be affected, the STEP SIZE REMAINS NEARLY IDENTICAL!
      • Another benefit is that the function is (approximately) bounded by the learning rate * the first moment

      • Furthermore, the objective DOES NOT NEED TO BE STATIONARY (identical/have the same variance/repeat themselves)

        • They can be almost random and this optimizer will still work
    • Algorithm in detail
      • Steps:
        • Requirements → stepsize, $B_1,B_2,0<=B<1$, $θ_0$ (params), and $f(θ)$ (cost function of parameters) in addition to epsillon
        • $m_0=0,v_0=0,t=0$ (initialize moments and timesteps)
        • While not converged
          • $t=t+1$
          • $g_t=∇f_t(θ_{t-1})$ → take the gradient of the cost function at the previous set of parameters
          • $m_t=B_1*m_{t-1}+(1-B_1)*g_t$ → find the first moment (estimating first moment)
            • The bias ensures that the total adds up to the same whole as one moment → the gradient
            • In this case, $m^{Δ}_t$ is the latter part → the CHANGE IN THE FIRST MOMENT (same with second moment)
          • $v_t=B_2v_{t-1}(1-B_2)*g^2$
            • Find the new second moment and bias it → the second moment is $g^2$
            • REMEMBER → $g$ IS A VECTOR
              • So, we’re summing over g, and adding WEIGHTED SUMS TO EACH ONE
              • $g$ is our probability density function!
              • And, $(1-B)$ is the weight for each one → THUS GIVING US THE SECOND MOMENT!
              • For the first one, ****this was a weighted mean with respect to the previous vector → this is a weighted mean squared! (distance from gradient to previous value in a sense)
              • So → THIS ENTIRE LINE IS ESTIMATING A MOMENT
                • We’re taking the averaged (biased) moments of all previous vectors, and then averaging THOSE WITH THE NEW ONE!
                • We’re basically estimating a new moment USING PREVIOUS MOMENTS
            • $m^{Δ}_t=m_t/(1-B_1^t)$ → update estimate while correcting for bias
            • Bias correction and why it’s important
              • Basically → normally, $B_2$ is closer to one than $B_1$, meaning that the second moment does not change as much as the first moment
              • So, in the first few terms where THERE ARE NO PREVIOUS ONES, the end update would be huge.
                • It’s simple → $m_1$ would be, due to a lower bias, larger than $v_1$
                • As the formula takes the square root AND THEN DIVIDES, this would mean that the update would be extremely large for the first couple of steps when there’s a really small $v_{t-1}$
              • On the other hand, if we just used the gradients as the updates directly, then the entire term would become less sensitive to the biases as we aren’t using them → we’re basically discarding them in the end step calculation
            • Note that the optimizer uses the moving average to estimate the first and second moments
              • THE MOVING AVERAGE IS THE EXACT SAME EQUATION AS $m_t$ and $v_t$!
              • We’re incorporating the weighted average of the new and old points together
              • Why can this be used to estimate a moment?
                • Simple → it’s a WEIGHTED SUM!
                • This itself is the moment formula → recall that the first term is just the weighted sum of all the other gradient biased sums
                • Therefore, as we aren’t using the exact values of the previous terms, this becomes a estimation of what the moments would be at that particular time
            • NOTE → THE BIASES ARE ACTIVELY DECAYING
              • The biases get smaller and smaller as they are raised to the exponent $^T$
              • As a result, over a long period of time, the denominator becomes larger and larger → therefore meaning that the updates become smaller and smaller
          • $v^{Δ}_t=v_t/(1-B_2)$ → same process but for the second moment
          • $θ_t=θ_{t-1} - \frac{a}{√(v^{Δ(t)}_j)+ε}*m^{Δ(t)}_j$ → update parameters
            • The second moment needs to be taken the square root of as this parameter otherwise means we’re dividing by the SQUARED GRADIENTS → preventing useful updates
            • Instead, we take the square root of the second moment estimation (the biased corrected one) to give a more realistic estimate of the second moment (almost the variance!)
  • Information about the algorithm
  • Bias Initializations
  • Analyzing Convergence
  • Key Terms:
  • PRACTICAL IMPLEMENTATION NOTES