Self-distillation on math: intro
Recently I read an interesting paper: Embarrassingly Simple Self-Distillation Improves Code Generation.
The thesis is as follows:
- In the code generation domain, certain tokens are "lock" positions (where the next token is obvious from context) and some are "fork" positions (where a given number of options would be roughly equally likely). The example the paper gives is writing a sorting function: when starting the function there are genuinely many different approaches to take (quicksort, insertion sort, mergesort…) but once an approach has been selected, the sequence of statements and tokens is locked in.
- Lock positions are hindered by increasing temperature: the long tail of irrelevant tokens grows heavier, increasing the chances of sampling a token that does not belong in the context.
- Fork positions are hindered by lowering temperature: it sharpens the distribution and forces more mass onto an arbitrary token from the previously flat head, increasing the chances of discarding useful alternatives.
- In contrast, SSD manages to help both forks and locks, by training a model on a truncated (through top-k-top-p), temperature-shifted version of its original distribution. The paper claims that this training works by preserving useful forks (by increasing temperature at training data sampling time) while reinforcing locks (by aggressively cutting off the tail).
- Crucially, this works without a correctness signal. You don't need to check if the solution was correct, or even whether a solution was produced. This is why it's "embarrassingly simple": all you need to do is sample some completions and train on those.
Domain choice
When I read this, I thought that if this is truly the case, then it must also apply to domains outside of coding which nevertheless have a similar lock/fork dichotomy. Which domains would apply?
- General tool calling and agentic workflows are the closest candidate, very close to coding conceptually. The fork is in deciding which tool to use in response to the user and what the previous tool calls have returned. Then once you commit to an approach, the lock is in coming up with a syntactically valid tool call.
- Competitive math also seems to have a similar structure: a problem statement, choosing an approach / lemma / theorem to apply (forks), setting up and carrying out the calculations or proof steps (lock), and arriving at a final answer.
I decided to try out competitive math, in part because it feels like more of a departure from the original paper (whereas tool calling is really just coding in disguise), and more practically because it's somewhat easier to stratify by difficulty, sample and verify: one prompt and response, no execution environment needed. Beyond that, there are a couple encouraging signals that math might work:
- The SSD paper mentions that after training on code, they verified and found that math performance did not degrade.
- The recent Deep-Thinking Tokens paper looked at several domains and found that on math benchmarks like AIME and HMMT, a structure became apparent where certain "syntactic" tokens had shallow thinking, and more semantically important tokens deeper thinking. Without going so far as to say "deep think = fork" and "shallow think = lock", I feel like there's an analogy there.
Model
Now that we have the goal and the domain settled, it's time to get into the details. Since I'm just a guy who has to pay for his own compute, I can't run an extensive suite of models and a heavy hparam grid search. So I decided to pick a single model that is not too large. Based on some simple rules of thumb (8 to 16 bytes VRAM per parameter before activations, very long sequence lengths), I decided to go with Qwen3.5-4B. The 4B scale is at a sweet spot of "small enough to run SFT on without needing a cluster of B200s, big enough to have decent performance". There are a couple smaller-scale variants, (0.8B, 2B), but the Qwen team specifically called out that those were prone to thinking loops, so I decided to only use those for smoke tests. Another decision characteristic was that 4B was the smallest scale used in the SSD paper, so I didn't want to go lower than that. We're already doing something quite different here with the new setting and new model family.
Tasks
Beyond just "competitive math", we can also talk a bit about what the tasks actually look like. Here's an example from BeyondAIME, which is one of the eval sets we'll use:
Problem:
Choose a subset of integers from 1 to 15 such that
no three consecutive integers are in the chosen set
(you can choose an empty set). How many different
sets are there?
Answer: 10609
Typically, step-by-step golden solutions include reasoning with LaTeX or LaTeX-like markup; a reasonable prompt template for the model is thus something like this:
Solve the following competitive math problem.
Reason step by step and put your final answer in \boxed{}.
The \boxed instruction is there to make it easier for us to extract an answer. We also need to normalize answers to allow the model to answer e.g. 0.75, 3/4, or \frac{3}{4}. Luckily this is already handled by libraries like math-verify.
Conclusion
So that's the goal for this project: see if there is an SSD-like effect on competitive math when running it on Qwen3.5-4B, and see if it does result in improved performance. In the next post I'll talk about what the data for this project will look like.