Knowledge distillation with insufficient training data for regression

Research output: Contribution to journalArticlepeer-review

8 Scopus citations

Abstract

Knowledge distillation has been widely used to compress a large teacher network into a smaller student network. Conventional approaches require the training dataset that was used to train the teacher network. However, in many real-world situations, the original training dataset is not fully-reusable owing to practical constraints, such as data security, privacy, and storage limits. In this study, we present a teacher–student matching method to improve knowledge distillation under data insufficiency for regression problems. Given an existing knowledge distillation method as the base, we introduce three additional learning objectives to make the student better emulate the prediction capability of the teacher: perturbation-based matching (PM), adversarial belief matching (ABM), and gradient matching (GM). PM is for matching the predictions of the teacher and student on synthetic data points created by perturbing original data points. ABM is for matching the predictions of the teacher and student on which the teacher and student make different predictions. GM is for matching the gradients of the teacher and student on the original and synthetic data points. We demonstrate that the proposed method improves the prediction performance of the student network, particularly when only a small part of the original training dataset is available for use. When 10% of the original training dataset is used for knowledge distillation, the root mean squared error of the student network is reduced by 43.91% on average compared with existing knowledge distillation methods.

Original languageEnglish
Article number108001
JournalEngineering Applications of Artificial Intelligence
Volume132
DOIs
StatePublished - 1 Jun 2024

Keywords

  • Data insufficiency
  • Knowledge distillation
  • Neural network
  • Regression

Fingerprint

Dive into the research topics of 'Knowledge distillation with insufficient training data for regression'. Together they form a unique fingerprint.

Cite this