MKD-Ultra: Compressing Causal Language Models in Multiple Steps

August 27, 2021

Student Team: Mamon Alsalihy, Austin King, Nilay Patel

Project Mentor: Sarangarajan “Partha” Parthasarathy, Microsoft

Modern deep neural networks have immensely powerful predictive power at the cost of equally great size and compute requirements. A lot of recent work has focused on compressing these large models into smaller versions with similar predictive capabilities. Particularly, transformer language models such as BERT and XLNet are targeted due to their unparalleled performances across the board. This project presented two compression techniques to train a small, autoregressive language model. A bidirectional teacher model was used to increase the information available and a multi-step distillation approach adopted to lessen the gap between large and small models.

MKD-Ultra: Compressing Causal Language Models in Multiple Steps (PDF)

Capstone Projects

2021