Self-distillation on math: data

Last time we introduced the project of porting the SSD paper on the competitive math domain. Let's make this a bit clearer, starting with the data we want to use. Recall that the goal is to train a Qwen3.5-4B on its own truncated + temperature-shifted reasoning traces, crucially without any correctness verification.

Data characteristics

Since we want to have an easy time evaluating the model's answers, we should focus on math problems with a specific answer, as opposed to proofs where the goal is to show a particular statement holds and thus requires a more involved step-by-step evaluation.

Another consideration is the difficulty of problem we want to target; competitive math can be relatively broad. Here, there are two factors that speak for evaluating harder problems:

The SSD authors observed a larger lift on hard problems in terms of pass@k, so harder problems are more interesting.
Qwen3.5-4B already does relatively well on competitive math (75% on recent HMMT according to the model card). Therefore we should look at problems at least as hard as those from HMMT.

We also care about data quality, size, and contamination. There are a lot of datasets targeting competitive math. Some sets are very high quality and verified (e.g. AIME, HMMT), but also small. Conversely, some sets like NuminaMath are huge, but automatically scraped and not as clean. For evals, we should also avoid “obvious” benchmarks like, again, AIME/HMMT that get reported often and thus are way more likely to end up in the model's training data. This suggests the following allocations:

For training, we actually don't care about quality too much since we don't train on the golden answer but rather self-distill. Nevertheless, something like MATH is useful as it comes with difficulty tags. Having some kind of golden answer is also useful as we'll see in the next post when we talk about the training recipe. So we want large sets, and don't care too much about quality nor contamination. Here, the main ones I settled on were the levels 4 and 5 of MATH as well as NuminaMath-CoT, which scrapes various competitions.
For evals, we want verified, hard, and ideally uncontaminated sets. This necessarily means that they are likely to be recent, and small. For example, BeyondAIME was published in June 2025, so it is less likely to be in Qwen3.5's training data (published in Feb 2026) than e.g. MATH (from 2021).

Eval budgets

We also need to select how much data to use. For training, the natural choice is to follow the SSD paper: it used about 10k examples for thinking models. For eval, we can use a classic power analysis to determine how much data we should be using:

$$ n \approx \frac{2\sigma^2(z_{\alpha/2} + z_{\beta})}{\delta^{2}} $$

This formula tells you how many samples you need in order to detect a desired effect size between two models. It's based on:

$\sigma^2$: The model's variance on the eval data. The original formula adds variance from a base and test model; here we can assume they will roughly have the same variance since the training effect from the original paper is small.
$\delta$: The effect size we want to reliably detect (e.g. $0.05$ for a 5 p.p. difference in accuracy); the smaller this is, the more samples we will need.
z-scores for $\alpha$ and $1 - \beta$, our acceptable false positive and false negative rates; typically 5% and 20% respectively.

The power analysis presents a bit of a chicken-and-egg issue for us: we have very little validation data, and we need to keep some of it for the final eval, so we need to estimate variance first without being able to draw from a large population. What I did to solve this was to select a few different slices to assemble a roughly 1k-sized validation set:

BeyondAIME, mentioned before already: relatively recent, novel, and hard. Only 100 rows.
OmniMath-2: filtered and high quality, but based on the somewhat less-recent OmniMath, and not as hard as BeyondAIME.
The English text-only math subset of OlympiadBench: same domain, but slightly less difficult and also not recent.
AIME archives: used mainly as a backfill to get to 1k rows; entirely saturated by larger models, more likely to be contaminated.

Then given this set, you can run an initial eval to get variance estimates. This is also a good opportunity to do a bit of variance decomposition. We have a limited amount of available eval prompts, so we can sample each prompt multiple times to get multiple answers, and thus reduce variance (see Miller (2024) for the classic practical guide on this). The catch: this only reduces within-prompt variance, i.e. the natural stochasticity of the model when answering the prompt; it does not reduce between-prompt variance, i.e. the difference in difficulty between each prompt. If the within component is very small, then sampling multiple times per prompt will not tighten your CIs much compared to spending the same inference budget on more distinct prompts.

In my case, I found, using the recommended decode settings of Qwen3.5-4B, that the variance was almost entirely (more than 85%) explained by the between-prompt component. The conclusion is that we should expand the validation set. However we also have a pragmatic decision to make, which is how long we want evals to take. Given the variance numbers I observed, if we want to find an effect of about 5p.p. (which is roughly in the ballpark that the SSD authors observed), we need about 2k prompts. Even when setting k=1 (which we can feel safe doing, given the variance decomposition), with the relatively long contexts we have to contend with (I'll get into this in the next post) this would mean evaluating a single checkpoint takes around three hours. Not practical for runs where we want to check evolution during training across checkpoints, or sweeps where we evaluate many final models. Therefore, I decided to:

Keep the 1k budget for now, and deploy it on strong candidates
Have a smaller version at 300 rows to get a quick initial read on a particular checkpoint's performance.

Contamination

As I mentioned earlier, another important aspect here is contamination detection. Since we're assembling a training and eval set ourselves, it's entirely possible the same or similar problems will end up in both. Beyond exact match, the classic way to detect this is fuzzy n-gram overlap / Jaccard similarity. We also have a bit of an extra trick here to improve precision and recall at the same time by checking if the answer of the two dupe candidates is the same: if the answers match, we may flag dupes with a lower threshold than if they don't match.

Conclusion

That sums up the main aspects of data selection:

10k training prompts, a scale that follows the SSD methodology
A diverse validation set of 1k rows, with smaller mixtures for intermediate checks
Evaluation amounts and strategy selected through variance decomposition and power analysis

I'm intentionally skipping over the data used for the final eval for now; we'll get back to it once we start looking at trained models.

In the next post we'll talk about baselines and initial diagnostics.