Could a researcher chime in: is it possible that smaller models have a higher potential for mapping activations to abstractions that are present in our own language understanding and if so, could we use these makings to make our models more transparent and explainable?
Distillation would not meaningfully help make a model more transparent or explainable.
I imagine you're thinking about bottle necking on a really tight latent space. This would require a different architecture which would be more difficult to train and would probably suffer in accuracy.
Often it's better to re-frame the training as a multi-targeted problem with an explanation component.
I'm curious about the "explanation component" approach you've hinted at... are their any publications you can point to with this approach? If not, can you maybe describe it groso modo?
It's just a function of multi-targeting. When problem can be framed in a way that one of the outputs helps explain the other output I would call that component to be an "explanation component".
For one of my problems I know that the model would need to 'see' an orientation field in order to make its predictions, so instead of just asking for the predictions I also ask for an orientation field. The loss function has only a small weighting and is set against a very rough ground truth along with a total variation loss. This ends up producing a nicer orientation field than the ground truth.
Say you want a model that takes in a picture of a person and tells you what emotion they are feeling - sad/happy/disgusted/whatever.
But you also want to know why it classifies a person as sad/happy/etc.
Then you make the model have 2 outputs. The first is just a classification of the emotion as you'd have normally.
While the second is which parts of the image contributed most to the classification - e.g. a heatmap over a smiling mouth in one case or over the squinted eyes in another.
I could take a unet and plug a classifier to it hmm. I tried something like that before but it did not work that well. Maybe what I did sucked and there is a better way
Not sure how that would work with a segmentation algorithm like Unet though. It makes more sense in showing where in an image that the network was activated to give the image a label. AFAIK Unet gives ever pixel a label so I don't see how you could do what the parent describes.
That is activation mapping. I understood that what he was suggesting is using another objective that is optimized to define why network decided what it did. But need loss function for that
For let’s say identifying a car we could come up with a labeled dataset that not only has classes but also different parts of the car labeled that we think differentiate car from a motorcycle or a whatever. Then model has to output both class and also segmentation or bounding boxes of the parts
Then we combine both losses from these outputs and train.
I tried using unet by attaching layers to it in different ways to interpret the segmentation. I could not get it to work well though
Afaik the more you tighten a bottleneck the more accuracy you lose, and much faster than you gain interpretability. My guess is that such abstractions would require very powerful "priors" (as in, knowledge stored in the network as opposed to being stored in the representation) that humans gained with evolution and that today's models don't possess.
That is definitely the next thing I would try :) mostly the reason why I started with a BiLSTM is that it's much easier to implement/debug, also afaik the time complexity of RNNs with respect to the sequence length is O(N) but it's O(N^2) for attentional models like a Transformer. Although it probably doesn't matter much on the scale of the SST-2 dataset.