
“By comparing the student’s predictions against the next-token suggestions made by the teacher, we produce an on-policy reward signal that enables the student to quickly improve the quality of its multi-token predictions,” they added.
At inference time, the system uses a confidence-adaptive (ConfAdapt) decoding strategy that dynamically determines how many tokens to emit per pass. When the model is highly confident, it outputs larger chunks. When uncertainty rises, it falls back to smaller steps, preserving accuracy while maintaining speed gains.
In experiments on GSM8K math reasoning benchmarks, an 8B parameter model achieved more than 3x acceleration with less than a 3 percent drop in accuracy. A smaller 4B parameter model reached similar speedups, though with a larger 7 percent drop in accuracy. More aggressive configurations pushed acceleration to 5x, though at steeper accuracy costs.

