Distilling knowledge from neural networks to build smaller and faster models

nicholast · on Nov 16, 2019

In case might help anyone, found that the graphics on this TDS piece were very helpful for intuiting the principles of teacher / student distillation discussed in the link. https://towardsdatascience.com/knowledge-distillation-simpli...

edejong · on Nov 15, 2019

Could a researcher chime in: is it possible that smaller models have a higher potential for mapping activations to abstractions that are present in our own language understanding and if so, could we use these makings to make our models more transparent and explainable?

duaoebg · on Nov 15, 2019

Distillation would not meaningfully help make a model more transparent or explainable.

I imagine you're thinking about bottle necking on a really tight latent space. This would require a different architecture which would be more difficult to train and would probably suffer in accuracy.

Often it's better to re-frame the training as a multi-targeted problem with an explanation component.

jszymborski · on Nov 15, 2019

I'm curious about the "explanation component" approach you've hinted at... are their any publications you can point to with this approach? If not, can you maybe describe it groso modo?

duaoebg · on Nov 15, 2019

It's just a function of multi-targeting. When problem can be framed in a way that one of the outputs helps explain the other output I would call that component to be an "explanation component".

For one of my problems I know that the model would need to 'see' an orientation field in order to make its predictions, so instead of just asking for the predictions I also ask for an orientation field. The loss function has only a small weighting and is set against a very rough ground truth along with a total variation loss. This ends up producing a nicer orientation field than the ground truth.

Tenoke · on Nov 15, 2019

Say you want a model that takes in a picture of a person and tells you what emotion they are feeling - sad/happy/disgusted/whatever.

But you also want to know why it classifies a person as sad/happy/etc.

Then you make the model have 2 outputs. The first is just a classification of the emotion as you'd have normally. While the second is which parts of the image contributed most to the classification - e.g. a heatmap over a smiling mouth in one case or over the squinted eyes in another.

You can do this with pretty much anything.

tlear · on Nov 16, 2019

Could you give a link to a paper?

I could take a unet and plug a classifier to it hmm. I tried something like that before but it did not work that well. Maybe what I did sucked and there is a better way

alwayslearning0 · on Nov 16, 2019

https://jacobgil.github.io/deeplearning/class-activation-map...

Not sure how that would work with a segmentation algorithm like Unet though. It makes more sense in showing where in an image that the network was activated to give the image a label. AFAIK Unet gives ever pixel a label so I don't see how you could do what the parent describes.

tlear · on Nov 17, 2019

That is activation mapping. I understood that what he was suggesting is using another objective that is optimized to define why network decided what it did. But need loss function for that

For let’s say identifying a car we could come up with a labeled dataset that not only has classes but also different parts of the car labeled that we think differentiate car from a motorcycle or a whatever. Then model has to output both class and also segmentation or bounding boxes of the parts

Then we combine both losses from these outputs and train.

I tried using unet by attaching layers to it in different ways to interpret the segmentation. I could not get it to work well though

sooheon · on Nov 16, 2019

How do you calculate loss wrt the second output? Do you have to label images with correct heatmaps ahead of time?

alexamadoriml · on Nov 15, 2019

Afaik the more you tighten a bottleneck the more accuracy you lose, and much faster than you gain interpretability. My guess is that such abstractions would require very powerful "priors" (as in, knowledge stored in the network as opposed to being stored in the representation) that humans gained with evolution and that today's models don't possess.

gok · on Nov 15, 2019

Why a BiLSTM instead of just a smaller Transformer?

alexamadoriml · on Nov 15, 2019

That is definitely the next thing I would try :) mostly the reason why I started with a BiLSTM is that it's much easier to implement/debug, also afaik the time complexity of RNNs with respect to the sequence length is O(N) but it's O(N^2) for attentional models like a Transformer. Although it probably doesn't matter much on the scale of the SST-2 dataset.

duaoebg · on Nov 15, 2019

Ah, you're the author. I missed that. Cool work by the way.

alexamadoriml · on Nov 16, 2019

Thanks!

redman25 · on Nov 16, 2019

Are distilled models as effective for generalizing? Can they be pretrained as effectively as Bert for other tasks?