TL;DR
We tackle black-box knowledge distillation (KD) of LLMs, where only teacher-generated texts are available (e.g., from proprietary APIs). Standard likelihood-based KD is infeasible in this setting. Our solution is GAD (Generative Adversarial Distillation) that enables on-policy learning of student without probability supervision. GAD trains a discriminator (acting as a on-policy reward model) to distinguish student (generator) from the teacher, forming a two-player minimax game. Results show Qwen2.5-14B-Instruct student trained with GAD achieves performance comparable to its teacher GPT-5 Chat on the LMSYS-Chat automatic evaluation. We open-source all code, data and models.
Key Results
Figure 1: Comparison between GAD and sequence-level knowledge distillation (SeqKD) trained on LMSYS-Chat dataset, evaluated by averaged GPT-4o scores. Left: Results on the LMSYS-Chat test set. Right: Average performance across Dolly, SelfInst, and Vicuna datasets.
Generative Adversarial Distillation
Figure 2: Overview of Generative Adversarial Distillation (GAD) framework.
The training objective is formulated as a two-player minimax game with the following value function $\mathcal{V}(G, D)$: $$\max_G \min_D \ \ \mathcal{V}(G, D) = \mathbb{E}_{(x,y_t) \sim \mathcal{T}} \left[-\log \sigma\left( D(y_t) - D(G(x)) \right)\right]$$ We split the objective into two parts: The generator is trained with the following objective: $$ \max_G \ \ \mathbb{E}_{(x,y_t) \sim \mathcal{T}}\left[D(G(x))\right]$$ We consider the discriminator score of student response $D(G(x))$ as scalar reward and use policy gradient to optimize the generator. We use GRPO as the RL algorithm. The discriminator is trained to minimize the loss: $$ \min_D \ \ \mathbb{E}_{(x,y_t) \sim \mathcal{T}} \left[-\log \sigma\left( D(y_t) - D(G(x)) \right)\right]$$ which is Bradley-Terry loss to learn the preference pair $y_t \succ G(x)$.
We use a two-stage training to warmup the generator and discriminator before GAD training. The generator is trained with cross-entropy loss on teacher responses, and the discriminator is trained with Bradley-Terry loss at the same time.
Figure 3: Off-policy discriminator suffers from reward hacking, whereas on-policy discriminator remains stable over thousands of training steps.
Experiments
We report the results of automatic evaluation using GPT-4o scores in Figure 1. We compare GAD with the instruct model before distillation and the SeqKD baseline. Across all datasets, GAD consistently outperforms the baselines. On the LMSYS-Chat test set, Qwen2.5-3B-Instruct trained with GAD matches the performance of Qwen2.5-7B-Instruct trained with SeqKD; similarly, Qwen2.5-7B-Instruct with GAD rivals Qwen2.5-14B-Instruct with SeqKD, and Qwen2.5-14B-Instruct with GAD is comparable to the GPT-5-Chat teacher.
In addition, GAD shows particularly strong gains on out-of-distribution generalization. On Dolly, SelfInst, and Vicuna, SeqKD yields marginal or even negative improvements, whereas GAD maintains robust performance gains. We attribute this to the superior generalization ability of reinforcement learning compared to supervised fine-tuning.
Figure 4: Human evaluation results on the LMSYS test dataset.
BibTeX
@article{ye2025blackboxonpolicydistillationlarge,
title={Black-Box On-Policy Distillation of Large Language Models},
author={Tianzhu Ye and Li Dong and Zewen Chi and Xun Wu and Shaohan Huang and Furu Wei},
journal={arXiv preprint arXiv:2511.10643},
year={2025},
url={https://arxiv.org/abs/2511.10643}
}