BORN-AGAIN NEURAL NETWORKS (KNOWLEDGE DISTILLATION AND BANs).

Understanding KD (Knowledge Distillation)
- The point of KD is to transfer knowledge between machine learning models → more accurately, transferring knowledge from a teacher model to a student model
  - The latter is usually more compact → the end goal of this is to achieve the same performance via a more compact model
  - Solves the problem of large models not using their capacity to the fullest extent → makes it easier to deploy ML on edge devices
- There are different types of KD:
  - Response-Based distillation
    - The Teacher model is trained first, and the student model is trained to mimic the teacher model
    - Distillation loss → difference between the predictions of the teacher model as compared to the student model
      - As a result, the student model learns to generate the same predictions as the teacher when given an input
    - Here, the distillation loss is measured as the discrepancy between predictions
  - Feature-Based distillation
    - Rather than learning to produce the same outputs, the goal of this method is to ensure that the students learn the same feature activations - the loss is calculated between layers as opposed to between predictions
    - Each layer can learn to discriminate against different features → we want to capture these features, so we aim to optimize for the difference in activations between teachers and students
    - So, the loss is measured at the end of each layer
  - Relation-Based distillation
    - Both of the above methods use the end predictions of layers and networks rather than the relationship captured between the layers themselves
    - The above two methods are what’s known as individual KD → they deal with losses between individual points in the network (i.e. individual layers and outputs)
      - So, a typical IKD objective would be calculated by $L_{IKD}=Σ_{x_i∈X}l(f_T(x_i),f_S(x_i)$
        
        $l$ → loss function that penalizes difference between teacher and student (like Kullback-Leibler divergence)
        
        $f_T$ → the logits (outputs) obtained by the teacher
        
        $f_S$ → logits obtained by the student
      - The sum of this loss function across all outputs is the value of the final output
      - Example of this applied in real life:
        
        $L_{IKD}=Σ_{x_i∈X}KL(softmax(\frac{f_T(x_i)}{t},softmax(\frac{f_S(x_i)}{t}))$
        
        Where KL is Kullback-Leibler divergence and $t$ is the temperature of the softmax function
        
        Temperature is simply a factor we divide the labels by to make the end predictions more or less confident
        
        Smaller $t$ (less than one) means smaller denominator and more confident predictions
        
        Greater than one yields less confident predictions
      - On the other hand, IKD loss for INDIVIDUAL LAYERS can be given by:
        
        $Σ_{x_i∈X}||f_T(x_i)-B(f_S(x_i))||^2_2$
        
        $B$ → beta is a linear mapping to bridge the different dimensions (the layer sizes are different so we must convert to the same dimensions to enable this to work)
        
        The double bars with subscript 2 indicate the L2 norm → in other words, finding the Euclidean Distance
        
        Then, we square the normalized difference vector as to circumvent negative activations and yield a POSITIVE result for the loss function (we’re interested in the absolute distance and not the values themselves; which is why the values are squared)
    - Relational KD, on the other hand, aims to transfer structural knowledge - what parts of the network are directly related to one another
      - Rather than computing the difference, it computes relational potential $Ψ$ between every $n$ data sample in the output
      - This is given by $L_{RKD}=Σ_{(x_1...x_n)∈X}l(Ψ(f_T(x_1)...f_T(x_n)),Ψ(f_S(x_1)...f_S(x_n)))$
      - Can be viewed as a generalization of IKD → when $Ψ$ is the identity matrix (meaning that when it is multiplied to the logits it will leave them idential) and $n=1$, it is the same as IKD formulas
      - Key purpose is to capture relationships that are higher level → not just between raw activations or outputs
      - Potential functions for $Ψ$:
        
        Distance-wise Distillation loss
        
        Measures Euclidean Distance across examples → the relational distance between layers
        
        Normalizes via the average distance of each point from the ORIGINAL DATA SAMPLE
        
        This helps us focus on the relative distance of each layer from each other
        
        The loss function used for this is the Huber loss:
        
        All in all, this allows the student to mimic the same relational “distances” as opposed to having the same output
        
        Angle-wise Distillation loss
        
        Instead of measuring the Euclidean distance between the two points, we measure the angle formed by three given points
        
        This provides us with angle wise potentials rather than raw distance → as angles are higher order properties it is hypothesized that it can transfer more important relational information
        
        The objective function used to penalize differences is still the Huber loss
    - Diagram:
BAN’s and a new perspective on KD
- Fundamental point of KD is to distill information → we can instead seek to train models of equal size by this method, which ends up improving performance
  - In this way, we train a teacher model → the student model is the same size but learns the most important information from the teacher while being able to make its own mappings (increasing performance)
- Key point → different stochastic algorithm procedures (different initializations, different optimization algorithms, etc.) can create vastly different models that have very similar validation performances
  - By combining these models into ensembles, we can dramatically increase performance to a number greater than any individual part
  - But → it is possible to find a single model (just as simple as one of the constituents) that replicates or exceeds the ensembles performance
    - 1996 paper found that just a single tree can replicate the performance of a tree ensembles
    - This provides increased simplicity while retaining performance → the same philosophy can be applied to other representations such as neural nets
    - BAN’s have the key objective of outdoing the “teacher” models → rather than just compressing ensemble/teacher knowledge, the goal is to mimic or exceed it
- This was done by initializing random new student models (random seed) with the core method being response-based IKD along with having the SAME OUTPUT DISTRIBUTION
  - Slight modification of response-based IKD
  - Allows the teacher to bias the student gradients towards a certain distribution → we’re almost repeating the convergence process with the teacher model, but from a different starting point (in a sense) to allow for potentially better convergence (the teacher model “encourages” the student to adapt gradients in a certain manner, potentially leading it to a better minima as time goes on)
  - The gradient induced in this student models has two key components:
    - Dark knowledge (DK) term: dark knowledge provides information on how models tend to generalize by highlighting hidden relationships learnt by the model
      - Specific example → let’s say you have a list of logits, and some are higher than others
      - If we introduce temperature to even out the distribution (allowing us to better tell the differences between different classes), then we can see what the model is likely to make mistakes on
      - For instance, if it outputs a higher probability for cow than car when classifying a dog, then we can conclude that the mistake of cow to dog is more probable to occur than dog for car
      - This is known as Dark Knowledge → information that isn’t important in terms of the actual results of the algorithm, but may hint at other relationships that need to be learned to complete the end task
      - Or, more accurately, it is behaviour the parent algorithm has learned about what constitutes a “wrong” class that can help a student algorithm better replicate the results of the teacher
    - Ground Truth term: these are basically rescaling the original gradient that the model would have obtained when using the real labels
  - In other words:
    - If we simply use the predictions given by the teacher model, then this will be of almost no benefit when compared to just training with the labels themselves
      - Why? Because the other values in the distribution will be so small that they will basically be negligible.
    - So, we introduce a temperature in the softmax activation of the final layer to better capture some of this dark knowledge and use the DK term to compute gradients
    - As a result, we can find the STUDENT LOSS via HARD PREDICTIONS, and find the DISTILLATION LOSS via SOFT PREDICTIONS (where the temperature has been scaled)
    - Diagram of this process:
  - But, we can improve on this process without the need for dark knowledge softening
    - More accurately → we can simply use the real labels but add importance weights depending on the teacher’s confidence in the maximum value reported
    - This tells us the teacher’s confidence without having to generate both soft and hard predictions (basically circumventing the need for a dark knowledge term)
    - So, we don’t directly need to match the teacher’s softmax distribution → this still uses dark knowledge but without softening the logits necessarily
  - Another method of doing KD besides dark knowledge involved matching attention maps → the “norm” across the channels dimension in each spatial location
  - Regarding self-distillation, a potential method for finding the distillation loss involves calculating inner products between activation tensors across different residual layers
- This process can be be further improved by training ensembles of student models as well, forming multiple generations of students.
- Interestingly, training simpler architectures as BANs can outperform more complex ones
How BANs work
Proposed Method + Mathematics
The Experimentation Process
Key Terms:

PRACTICAL IMPLEMENTATION NOTES.