Self-distillation on math: probing
Last time we trained an SSD model on competitive math, but with mixed results: the trained model significantly beats the baseline at high eval temperature, but is not stronger than simple temperature tuning. At T_eval = 0.7, the trained model is directionally better, but not by a large enough margin. Today, we're going to look into the behaviors of the two models and see if we can find an SSD-like effect on lock and fork tokens, or if the gains are mainly attributable to "regular" training effects (i.e. shifting the model's distribution towards correct choices).
The probing mechanism
The probing mechanism works as follows:
- We select a few "carriers": prompt+response pairs from the data. Those can be generated either by the baseline or the trained model. We select them such that we have enough to do some basic stratification on length, correctness status, and generator model.
- At each carrier, we select some amount of token positions to compare in the response. A few strategies include at regular intervals, around heuristically-important areas (e.g. around formatting tokens, \boxed, declaratory statements like "Let", "Suppose", etc.), and a random selection on top.
- For each of these token positions, we compare the output distributions of the baseline and trained models, before and after top-k-top-p truncation.
- We use some heuristics to classify tokens as fork-like or lock-like;
- Lock-like: high top-1 probability, small nucleus, low tail mass
- Fork-like: Multiple tokens in top-5 or top-10 with similar probability, significant nucleus mass with low tail mass (we must look at the tail to distinguish true forks from diffuse cases where the model is genuinely uncertain and assigns low probability across the whole support).
- We record statistic deltas between the two models, like top-1 probability, nucleus size, tail mass, etc.
You can see examples of the output of this probing below.
With this kind of information, we can decide if the trained model indeed exhibits an SSD effect or not.
Choice of models
In order to make my life easier, I decided to compare the trained model and the baseline at T_eval = 1.5. If you recall from the previous post, at this temperature, the trained model is unambiguously better than the baseline. Therefore, we can assume that there is indeed some kind of effect from the training procedure, and we're trying to distinguish whether it's caused by SSD mechanics or by something else.
Analysis
Initial statistics
The initial stats reveal the following:
- Few tokens are considered to be mixed/uncertain; most cleanly fall either into the fork-like or the lock-like heuristic
- The overall top-1 probability delta is slightly negative, even for lock-like tokens
- Tail compression is visible overall
These stats don't tell us enough as they are, but they seem to signal "output distribution shift" rather than "SSD effect". Let's look at each case in more detail.
Lock sharpening
SSD predicts that locks will sharpen: positions where only one token is acceptable from the context will see that token be strengthened, while the so-called distractor tail gets compressed away. Looking at some samples, we can already see that heuristics don't tell the whole story. Here's an example of what I call a "true" lock: both models agree on what the top token is, but the trained model indeed assigns higher probability to that token, reducing the chances that we will sample the wrong thing. In this case, the top-1 probability statistic goes in the direction we expect.
So I am very confident in the answer $-0.5$.
The final answer should be presented inside `\boxed{}`.
So `\boxed{-1/2}`.
| Rank | Base token (prob) | SSD token (prob) |
|---|---|---|
| 1 | { (0.5268) | { (0.79383) |
| 2 | {- (0.4649) | {- (0.20071) |
| 3 | `. (0.0075144) | `. (0.0041657) |
| 4 | {} (0.00037412) | {} (0.00082027) |
| 5 | ` (0.00012146) | {}. (0.0002074) |
In contrast, here's an example of a token that heuristics classify as a lock, but that could be more accurately described as a failure from the baseline:
Final Answer is \boxed{C}.
Wait, box format is \boxed{answer}. The question asks to put the final answer within boxed.
The answer is an option letter.
| Rank | Base token (prob) | SSD token (prob) |
|---|---|---|
| 1 | Wait (0.62804) | </think> (0.43923) |
| 2 | </think> (0.15879) | Wait (0.38762) |
| 3 | Or (0.075009) | Or (0.059443) |
| 4 | I (0.045495) | The (0.052458) |
| 5 | The (0.045495) | I (0.036054) |
Here, training did the right thing: commit to outputting an answer, rather than reinforcing the self-doubt / self-correction mechanism. This shows that not all tokens tagged as locks are true locks. We can further decompose our lock heuristic in the following subcategories:
- True (or strict) locks, where the two models agree on the top token;
- Swaps, where both models output a lock-like distribution but with a different top token;
- Tokens in a self-correction context (Wait, Let's check...)
When computing aggregate stats over these categories, we see a bit of an encouraging sign: strict locks now show a positive top-1 delta (indicating the trained model tends to sharpen them), whereas swaps and self-corrections contribute negative top-1 delta (indicating that the trained model disagrees with and pushes down the baseline's choice).
Fork diversity preservation
SSD predicts that forks are preserved even as locks are sharpened. The overall stats largely bear this out (although the mean nucleus size does get reduced from 12 to 10 tokens). However, we also observe cases where the heuristic does not flag true forks. An example similar to the Wait lock above:
Is $(2,2,3)$ a valid assignment for $(a,b,c)$? Yes. Okay, proceed with this solution. </think> The problem asks [...]
| Rank | Base token (prob) | SSD token (prob) |
|---|---|---|
| 1 | </think> (0.2174) | </think> (0.40585) |
| 2 | I (0.19185) | I (0.14931) |
| 3 | The (0.13186) | If (0.11628) |
| 4 | If (0.13186) | The (0.10262) |
| 5 | Also (0.070579) | Also (0.06224) |
This "fork" is simply the baseline model failing to commit to an answer, and the trained model moving to assign more weight to the correct token. In contrast, here's a fork that is more typical:
So, we get the identity: $$a^2 + b^2 + c^2 = 0$$ 3. **Look for Alternatives:** We need to find $S = a+b+c$. Let's denote the equations as:
| Rank | Base token (prob) | SSD token (prob) |
|---|---|---|
| 1 | Let (0.22876) | Let (0.31029) |
| 2 | Subtract (0.12245) | We (0.11415) |
| 3 | Sometimes (0.095362) | Sometimes (0.10074) |
| 4 | The (0.084157) | The (0.088901) |
| 5 | We (0.084157) | Subtract (0.078455) |
This exemplifies relatively well the behavior of true forks under our SSD recipe: they preserve meaningful alternatives, but do see some sharpening at the same time as moving tail mass into the top-20.
Summary of effects
Our probing exercise reveals an interesting structure. We certainly see some examples of tokens that exhibit the behavior predicted by SSD. At the same time, we also see many examples where this is not the case, and the movement in output probabilities is more adequately explained by generic training.
In the next post, we'll discuss those results in the context of the larger training effort and see if we achieved our goal and what we learned.