Knowledge distillation, also called model distillation, transfers learning from a large “teacher” AI model to a smaller “student” model. This makes the smaller model nearly as capable but faster and cheaper to run on devices like phones.
Core Concept
The teacher processes data and produces outputs (like probability distributions or “soft targets”). The student trains to match these, capturing nuances that hard labels alone miss.
Key Methods
Response-based: Student mimics teacher’s output probabilities (logits).
Feature-based: Student copies teacher’s internal layers or activations.
Self-distillation: Model distills from its own later checkpoints.
Distillation shrinks models dramatically—like ChatGPT Turbo from larger versions—while keeping performance high.