Self-distillation on math: data

Last time we introduced the project of porting the SSD paper on the competitive math domain. Let's make this a bit clearer, starting with the data we want to use. Recall that the goal is to train a Qwen3.5-4B on its own truncated + temperature-shifted reasoning traces, crucially without any correctness verification.

Data characteristics

Since we want to have an easy time evaluating the model's answers, we should focus on math problems with a specific answer, as opposed to proofs where the goal is to show a particular statement holds and thus requires a more involved step-by-step evaluation.

Another consideration is the difficulty of problem we want to target; competitive math can be relatively broad. Here, there are two factors that speak for evaluating harder problems:

We also care about data quality, size, and contamination. There are a lot of datasets targeting competitive math. Some sets are very high quality and verified (e.g. AIME, HMMT), but also small. Conversely, some sets like NuminaMath are huge, but automatically scraped and not as clean. For evals, we should also avoid “obvious” benchmarks like, again, AIME/HMMT that get reported often and thus are way more likely to end up in the model's training data. This suggests the following allocations:

Eval budgets

We also need to select how much data to use. For training, the natural choice is to follow the SSD paper: it used about 10k examples for thinking models. For eval, we can use a classic power analysis to determine how much data we should be using:

$$ n \approx \frac{2\sigma^2(z_{\alpha/2} + z_{\beta})}{\delta^{2}} $$

This formula tells you how many samples you need in order to detect a desired effect size between two models. It's based on:

The power analysis presents a bit of a chicken-and-egg issue for us: we have very little validation data, and we need to keep some of it for the final eval, so we need to estimate variance first without being able to draw from a large population. What I did to solve this was to select a few different slices to assemble a roughly 1k-sized validation set:

Then given this set, you can run an initial eval to get variance estimates. This is also a good opportunity to do a bit of variance decomposition. We have a limited amount of available eval prompts, so we can sample each prompt multiple times to get multiple answers, and thus reduce variance (see Miller (2024) for the classic practical guide on this). The catch: this only reduces within-prompt variance, i.e. the natural stochasticity of the model when answering the prompt; it does not reduce between-prompt variance, i.e. the difference in difficulty between each prompt. If the within component is very small, then sampling multiple times per prompt will not tighten your CIs much compared to spending the same inference budget on more distinct prompts.

In my case, I found, using the recommended decode settings of Qwen3.5-4B, that the variance was almost entirely (more than 85%) explained by the between-prompt component. The conclusion is that we should expand the validation set. However we also have a pragmatic decision to make, which is how long we want evals to take. Given the variance numbers I observed, if we want to find an effect of about 5p.p. (which is roughly in the ballpark that the SSD authors observed), we need about 2k prompts. Even when setting k=1 (which we can feel safe doing, given the variance decomposition), with the relatively long contexts we have to contend with (I'll get into this in the next post) this would mean evaluating a single checkpoint takes around three hours. Not practical for runs where we want to check evolution during training across checkpoints, or sweeps where we evaluate many final models. Therefore, I decided to:

Contamination

As I mentioned earlier, another important aspect here is contamination detection. Since we're assembling a training and eval set ourselves, it's entirely possible the same or similar problems will end up in both. Beyond exact match, the classic way to detect this is fuzzy n-gram overlap / Jaccard similarity. We also have a bit of an extra trick here to improve precision and recall at the same time by checking if the answer of the two dupe candidates is the same: if the answers match, we may flag dupes with a lower threshold than if they don't match.

Conclusion

That sums up the main aspects of data selection:

I'm intentionally skipping over the data used for the final eval for now; we'll get back to it once we start looking at trained models.

In the next post we'll talk about baselines and initial diagnostics.