Self-distillation on math: intro

Recently I read an interesting paper: Embarrassingly Simple Self-Distillation Improves Code Generation.

The thesis is as follows:

Domain choice

When I read this, I thought that if this is truly the case, then it must also apply to domains outside of coding which nevertheless have a similar lock/fork dichotomy. Which domains would apply?

I decided to try out competitive math, in part because it feels like more of a departure from the original paper (whereas tool calling is really just coding in disguise), and more practically because it's somewhat easier to stratify by difficulty, sample and verify: one prompt and response, no execution environment needed. Beyond that, there are a couple encouraging signals that math might work:

Model

Now that we have the goal and the domain settled, it's time to get into the details. Since I'm just a guy who has to pay for his own compute, I can't run an extensive suite of models and a heavy hparam grid search. So I decided to pick a single model that is not too large. Based on some simple rules of thumb (8 to 16 bytes VRAM per parameter before activations, very long sequence lengths), I decided to go with Qwen3.5-4B. The 4B scale is at a sweet spot of "small enough to run SFT on without needing a cluster of B200s, big enough to have decent performance". There are a couple smaller-scale variants, (0.8B, 2B), but the Qwen team specifically called out that those were prone to thinking loops, so I decided to only use those for smoke tests. Another decision characteristic was that 4B was the smallest scale used in the SSD paper, so I didn't want to go lower than that. We're already doing something quite different here with the new setting and new model family.

Tasks

Beyond just "competitive math", we can also talk a bit about what the tasks actually look like. Here's an example from BeyondAIME, which is one of the eval sets we'll use:

Problem:
Choose a subset of integers from 1 to 15 such that
no three consecutive integers are in the chosen set
(you can choose an empty set). How many different
sets are there?

Answer: 10609

Typically, step-by-step golden solutions include reasoning with LaTeX or LaTeX-like markup; a reasonable prompt template for the model is thus something like this:

Solve the following competitive math problem.
Reason step by step and put your final answer in \boxed{}.

The \boxed instruction is there to make it easier for us to extract an answer. We also need to normalize answers to allow the model to answer e.g. 0.75, 3/4, or \frac{3}{4}. Luckily this is already handled by libraries like math-verify.

Conclusion

So that's the goal for this project: see if there is an SSD-like effect on competitive math when running it on Qwen3.5-4B, and see if it does result in improved performance. In the next post I'll talk about what the data for this project will look like.