MKD-Ultra: Compressing Causal Language Models in Multiple Steps
Student Team: Mamon Alsalihy, Austin King, Nilay Patel
Project Mentor: Sarangarajan “Partha” Parthasarathy, Microsoft
Modern deep neural networks have immensely powerful predictive power at the cost of equally great size and compute requirements. A lot of recent work has focused on compressing these large models into smaller versions with similar predictive capabilities. Particularly, transformer language models such as BERT and XLNet are targeted due to their unparalleled performances across the board. This project presented two compression techniques to train a small, autoregressive language
model. A bidirectional teacher model was used to increase the information available and a multi-step distillation approach adopted to lessen the gap between large and small models.