Black-Box On-Policy Distillation of
Large Language Models

Tianzhu Ye*, Li Dong*, Zewen Chi, Xun Wu, Shaohan Huang, Furu Wei
Microsoft Research
*Indicates Equal Contribution
Code arXiv Data Models

TL;DR

We tackle black-box knowledge distillation (KD) of LLMs, where only teacher-generated texts are available (e.g., from proprietary APIs). Standard likelihood-based KD is infeasible in this setting. Our solution is GAD (Generative Adversarial Distillation) that enables on-policy learning of student without probability supervision. GAD trains a discriminator (acting as a on-policy reward model) to distinguish student (generator) from the teacher, forming a two-player minimax game. Results show Qwen2.5-14B-Instruct student trained with GAD achieves performance comparable to its teacher GPT-5 Chat on the LMSYS-Chat automatic evaluation. We open-source all code, data and models.

Key Results

Fig 1

Figure 1: Comparison between GAD and sequence-level knowledge distillation (SeqKD) trained on LMSYS-Chat dataset, evaluated by averaged GPT-4o scores. Left: Results on the LMSYS-Chat test set. Right: Average performance across Dolly, SelfInst, and Vicuna datasets.

Generative Adversarial Distillation

We address the challenge of black-box distillation, where the student language model must learn from a proprietary teacher model that provides only textual responses $y_t$ for a given prompt $x$. Recent studies in white-box distillation highlight the importance of on-policy learning, where the student learns from its own generated responses. However, extending this idea to the black-box setting introduces a major challenge: when the student produces its own responses, there are no probability-level supervision signals available from the teacher to evaluate or correct them.
Method

Figure 2: Overview of Generative Adversarial Distillation (GAD) framework.

We propose Generative Adversarial Distillation (GAD). The training dataset $\mathcal{T}=\{(x, y_t)\}$ is constructed by input prompts $x$ and teacher response $y_t$. The framework consists of a generator $G$ which is the student model, and a discriminator $D$. The generator generates the response $G(x)$ to the prompt $x$. The discriminator predicts a sequence-level scalar score $D([x,y])$ (denoted as $D(y)$ below for brevity) given prompt $x$ and response $y$.
The training objective is formulated as a two-player minimax game with the following value function $\mathcal{V}(G, D)$: $$\max_G \min_D \ \ \mathcal{V}(G, D) = \mathbb{E}_{(x,y_t) \sim \mathcal{T}} \left[-\log \sigma\left( D(y_t) - D(G(x)) \right)\right]$$ We split the objective into two parts:
The generator is trained with the following objective: $$ \max_G \ \ \mathbb{E}_{(x,y_t) \sim \mathcal{T}}\left[D(G(x))\right]$$ We consider the discriminator score of student response $D(G(x))$ as scalar reward and use policy gradient to optimize the generator. We use GRPO as the RL algorithm.
The discriminator is trained to minimize the loss: $$ \min_D \ \ \mathbb{E}_{(x,y_t) \sim \mathcal{T}} \left[-\log \sigma\left( D(y_t) - D(G(x)) \right)\right]$$ which is Bradley-Terry loss to learn the preference pair $y_t \succ G(x)$.
We use a two-stage training to warmup the generator and discriminator before GAD training. The generator is trained with cross-entropy loss on teacher responses, and the discriminator is trained with Bradley-Terry loss at the same time.
Reward

Figure 3: Off-policy discriminator suffers from reward hacking, whereas on-policy discriminator remains stable over thousands of training steps.

From a reinforcement learning perspective, we reinterpret the generator as a policy model and the discriminator as a reward model. The generator produces responses, receives rewards from the discriminator, and is optimized to maximize the expected reward. The discriminator is trained with Bradley-Terry loss on preference pairs to score the teacher response higher, similar to the reward model in RLHF. While conventional RLHF trains a fixed reward model prior to policy optimization which is prone to reward hacking (Figure 3), our approach updates the reward model (discriminator) online to adapt it to the current policy continually, making it more robust and stable in training.

Experiments

We use a subset of LMSYS-Chat-1M dataset to train the student model. The dataset is derived from high-quality conversational data collected via the Chatbot Arena platform. We adopt GPT-5-Chat as the teacher model. It is a closed-source chat model ranked ninth on the Chatbot Text Arena leaderboard at the time of writing. For student models, we use the instruction-tuned variants of open-source models from the Qwen2.5 family (Qwen2.5-3B-Instruct, Qwen2.5-7B-Instruct, Qwen2.5-14B-Instruct) and the Llama3 family (Llama-3.2-3B-Instruct, Llama-3.1-8B-Instruct).
We report the results of automatic evaluation using GPT-4o scores in Figure 1. We compare GAD with the instruct model before distillation and the SeqKD baseline. Across all datasets, GAD consistently outperforms the baselines. On the LMSYS-Chat test set, Qwen2.5-3B-Instruct trained with GAD matches the performance of Qwen2.5-7B-Instruct trained with SeqKD; similarly, Qwen2.5-7B-Instruct with GAD rivals Qwen2.5-14B-Instruct with SeqKD, and Qwen2.5-14B-Instruct with GAD is comparable to the GPT-5-Chat teacher.
In addition, GAD shows particularly strong gains on out-of-distribution generalization. On Dolly, SelfInst, and Vicuna, SeqKD yields marginal or even negative improvements, whereas GAD maintains robust performance gains. We attribute this to the superior generalization ability of reinforcement learning compared to supervised fine-tuning.
User Study

Figure 4: Human evaluation results on the LMSYS test dataset.

We also provide human evaluation results in Figure 4. GAD achieves a win rate exceeding 50% and a loss rate below 30% in almost all comparisons. The results indicate that GAD can consistently outperform the baseline models on human evaluation performance. For more results and analysis, please refer to our paper for details!

BibTeX

@article{ye2025blackboxonpolicydistillationlarge,
  title={Black-Box On-Policy Distillation of Large Language Models},
  author={Tianzhu Ye and Li Dong and Zewen Chi and Xun Wu and Shaohan Huang and Furu Wei},
  journal={arXiv preprint arXiv:2511.10643},
  year={2025},
  url={https://arxiv.org/abs/2511.10643}
}