Understanding KD (Knowledge Distillation)
Both of the above methods use the end predictions of layers and networks rather than the relationship captured between the layers themselves
The above two methods are what’s known as individual KD → they deal with losses between individual points in the network (i.e. individual layers and outputs)
Relational KD, on the other hand, aims to transfer structural knowledge - what parts of the network are directly related to one another
Measures Euclidean Distance across examples → the relational distance between layers
Normalizes via the average distance of each point from the ORIGINAL DATA SAMPLE
The loss function used for this is the Huber loss:
All in all, this allows the student to mimic the same relational “distances” as opposed to having the same output
Instead of measuring the Euclidean distance between the two points, we measure the angle formed by three given points
This provides us with angle wise potentials rather than raw distance → as angles are higher order properties it is hypothesized that it can transfer more important relational information
The objective function used to penalize differences is still the Huber loss
Diagram:
BAN’s and a new perspective on KD
Slight modification of response-based IKD
Allows the teacher to bias the student gradients towards a certain distribution → we’re almost repeating the convergence process with the teacher model, but from a different starting point (in a sense) to allow for potentially better convergence (the teacher model “encourages” the student to adapt gradients in a certain manner, potentially leading it to a better minima as time goes on)
The gradient induced in this student models has two key components:
In other words:
If we simply use the predictions given by the teacher model, then this will be of almost no benefit when compared to just training with the labels themselves
So, we introduce a temperature in the softmax activation of the final layer to better capture some of this dark knowledge and use the DK term to compute gradients
As a result, we can find the STUDENT LOSS via HARD PREDICTIONS, and find the DISTILLATION LOSS via SOFT PREDICTIONS (where the temperature has been scaled)
Diagram of this process:
But, we can improve on this process without the need for dark knowledge softening
Another method of doing KD besides dark knowledge involved matching attention maps → the “norm” across the channels dimension in each spatial location
Regarding self-distillation, a potential method for finding the distillation loss involves calculating inner products between activation tensors across different residual layers
How BANs work
Proposed Method + Mathematics
The Experimentation Process
Key Terms: