Self-distillation on math: postscript

For this post, I want to focus less on the scientific and experimental side of things, and more on the engineering and operational aspects. I ran this project in my free time and on my own budget, which led me to some intersting situations that I want to look at in more detail.

The cost of getting compute

I have a quite beefy local setup (my HuggingFace profile tells me I'm solidly in "GPU rich" territory), but even that wasn't enough for the experiments I wanted to run. Therefore, I did not do any training locally beyond pre-flight checks before running on rented compute. I wanted to minimize the amount of time wasted on setup so I tested the infra with the 0.8B scale model which did fit. I was very concerned about getting everything to work under a single H100 (80 GiB VRAM) to simplify the setup and cost. This is one of the main reasons I went with a small scale of 4B. I learned about and applied many small tricks to reduce the VRAM usage during training, in particular:

Of these, not all were equally successful. Two of them were even harmful:

Basically, I got a bit carried away with the optimization in order to make sure I could fit under my self-imposed VRAM target. I made it work, but that became irrelevant anyway because...

Compute bottlenecks

This is something that surprised me: sampling (either training data or responses for evals) took something like 80% of all the GPU time I paid for. Once I saw this, I decided I should optimize my sampler as well, but there's only so much one guy can do to make vLLM faster. So I focused on another axis: just scaling up compute. At first I did simple data parallelism, but then I figured out the obvious thing: I was running very short prompts with very long generations, so entirely decode-bound. This meant I didn't want more compute in general, I wanted more bandwidth. Which meant I had to consider...

Tradeoff between price and performance

Each on-demand GPU cloud solution has a variety of options to offer, where the price is supposed to align with the performance. My question was: is it worth it to pay for the more expensive GPUs because they will finish the job faster? As it turned out, for sampling specifically, the total price of a job was very similar across different GPUs with different hourly rates (I guess this makes sense in an efficient-markety sort of way, assuming compute is roughly fungible). So it was not a question of how much I had to pay for a job, but whether I wanted the results sooner than later. That's a no-brainer, so I switched from H100s to B200s, which incidentally have so much VRAM that I wouldn't need to worry at all about usage!

Moral of the story

Overall, this is a classic case of premature optimization. I feel like I was justified by wanting to reduce costs, but it's true that I did not realize what the true bottleneck was at the start and I made some assumptions that did not help much. Instead I should have aimed for a stable setup first, with proven good training dynamics, then measured what was taking time in that stable setup.

Operations

I set up my workspace so that I just had to clone my repo on the remote machine and sync the environment to get started. But at some point I also wanted the AI to take over monitoring of experiments (to catch issues like: I made a typo and reserved 84 GiB of disk instead of 384, so training will crash two hours in when I'm asleep). So I had it write a small server that it could connect to using my API key. This server exposed commands that mirrored the CLI tool I was using to run experiments locally: check available configs and hardware, run stages one by one or all at once, checking job progress and logs, etc. I thought I was being clever and making things easy, but a couple times the server went down for some reason and the agent merrily carried on by using runpodctl and SSH (I had previously given it an SSH key for this purpose) to control the remote, with no apparent issues. So the server idea was cool, but perhaps unnecessary. Overall, maybe setting up a Docker image would have saved more time, but I never got around to it.

Themes

I would say the main thing I would change on this project would be: less overengineering and more focus on quality. Even though I ran experiments reasonably fast (given my spotty schedule), I got sidetracked a couple times on these optimization sidequests that seemed important at the time but didn't matter much in the end. I would have gotten results faster by focusing on the experimental side:

  1. Start running an experiment.
  2. While it runs, think of what to do given the possible results.
  3. Only if you have time here, think of improving the infra, cost, performance, etc.

Or simply ABE: Always Be Experimenting.