Self-distillation on math: probing

Last time we trained an SSD model on competitive math, but with mixed results: the trained model significantly beats the baseline at high eval temperature, but is not stronger than simple temperature tuning. At T_eval = 0.7, the trained model is directionally better, but not by a large enough margin. Today, we're going to look into the behaviors of the two models and see if we can find an SSD-like effect on lock and fork tokens, or if the gains are mainly attributable to "regular" training effects (i.e. shifting the model's distribution towards correct choices).

The probing mechanism

The probing mechanism works as follows:

You can see examples of the output of this probing below.

With this kind of information, we can decide if the trained model indeed exhibits an SSD effect or not.

Choice of models

In order to make my life easier, I decided to compare the trained model and the baseline at T_eval = 1.5. If you recall from the previous post, at this temperature, the trained model is unambiguously better than the baseline. Therefore, we can assume that there is indeed some kind of effect from the training procedure, and we're trying to distinguish whether it's caused by SSD mechanics or by something else.

Analysis

Initial statistics

The initial stats reveal the following:

These stats don't tell us enough as they are, but they seem to signal "output distribution shift" rather than "SSD effect". Let's look at each case in more detail.

Lock sharpening

SSD predicts that locks will sharpen: positions where only one token is acceptable from the context will see that token be strengthened, while the so-called distractor tail gets compressed away. Looking at some samples, we can already see that heuristics don't tell the whole story. Here's an example of what I call a "true" lock: both models agree on what the top token is, but the trained model indeed assigns higher probability to that token, reducing the chances that we will sample the wrong thing. In this case, the top-1 probability statistic goes in the direction we expect.

  So I am very confident in the answer $-0.5$.
  The final answer should be presented inside `\boxed{}`.
  So `\boxed{-1/2}`.
RankBase token (prob)SSD token (prob)
1{ (0.5268){ (0.79383)
2{- (0.4649){- (0.20071)
3`. (0.0075144)`. (0.0041657)
4{} (0.00037412){} (0.00082027)
5` (0.00012146){}. (0.0002074)

In contrast, here's an example of a token that heuristics classify as a lock, but that could be more accurately described as a failure from the baseline:

Final Answer is \boxed{C}.
Wait, box format is \boxed{answer}. The question asks to put the final answer within boxed.
The answer is an option letter.
RankBase token (prob)SSD token (prob)
1Wait (0.62804)</think> (0.43923)
2</think> (0.15879)Wait (0.38762)
3Or (0.075009)Or (0.059443)
4I (0.045495)The (0.052458)
5The (0.045495)I (0.036054)

Here, training did the right thing: commit to outputting an answer, rather than reinforcing the self-doubt / self-correction mechanism. This shows that not all tokens tagged as locks are true locks. We can further decompose our lock heuristic in the following subcategories:

When computing aggregate stats over these categories, we see a bit of an encouraging sign: strict locks now show a positive top-1 delta (indicating the trained model tends to sharpen them), whereas swaps and self-corrections contribute negative top-1 delta (indicating that the trained model disagrees with and pushes down the baseline's choice).

Fork diversity preservation

SSD predicts that forks are preserved even as locks are sharpened. The overall stats largely bear this out (although the mean nucleus size does get reduced from 12 to 10 tokens). However, we also observe cases where the heuristic does not flag true forks. An example similar to the Wait lock above:

Is $(2,2,3)$ a valid assignment for $(a,b,c)$? Yes.
Okay, proceed with this solution.
</think>

The problem asks [...]
RankBase token (prob)SSD token (prob)
1</think> (0.2174)</think> (0.40585)
2I (0.19185)I (0.14931)
3The (0.13186)If (0.11628)
4If (0.13186)The (0.10262)
5Also (0.070579)Also (0.06224)

This "fork" is simply the baseline model failing to commit to an answer, and the trained model moving to assign more weight to the correct token. In contrast, here's a fork that is more typical:

  So, we get the identity:
  $$a^2 + b^2 + c^2 = 0$$

3.  **Look for Alternatives:**
   We need to find $S = a+b+c$.
  Let's denote the equations as:
RankBase token (prob)SSD token (prob)
1 Let (0.22876) Let (0.31029)
2 Subtract (0.12245) We (0.11415)
3 Sometimes (0.095362) Sometimes (0.10074)
4 The (0.084157) The (0.088901)
5 We (0.084157) Subtract (0.078455)

This exemplifies relatively well the behavior of true forks under our SSD recipe: they preserve meaningful alternatives, but do see some sharpening at the same time as moving tail mass into the top-20.

Summary of effects

Our probing exercise reveals an interesting structure. We certainly see some examples of tokens that exhibit the behavior predicted by SSD. At the same time, we also see many examples where this is not the case, and the movement in output probabilities is more adequately explained by generic training.

In the next post, we'll discuss those results in the context of the larger training effort and see if we achieved our goal and what we learned.