o1 - Are the curves awesome?

Open AI's Blog post announcing the o1 model contained charts that showed the training and test-time performance of the new model.

Clearly, the new o1 model is groundbreaking - and I'll be talking about how it represents a new direction in which compute can scale towards better performance at the September-2024 Machine Learning Singapore MeetUp. However, people seem to be over-interpreting the graphs provided, which (IMHO) do not demonstrate the breakthrough in test-time performance that is being attributed to the o1 innovation : There's a simpler explanation.

I tweeted out about the issue, and attached a Colab notebook that demonstrates how people also not parsing out OpenAI's messaging carefully (combined with OpenAI being in no hurry to correct people's misconceptions):

What does the test-time curve say?

There are two levels to the o1 model compute usage:

the variable length 'reasoning' that the model does for a given 'rollout' - it seems like the model will think for longer for harder tasks. Although the blog post only shows a limited number of these rollouts (which include some backtracking / rethinking / etc), the model is clearly doing something interesting and new here (compared to conventional LLMs).
the number of different rollouts allowed for a given problem. This is part of the final test score : Multiple rollouts are combined in some way (e.g. voting, ranking)

While OpenAI seems keen to emphasise how they've achieved 'log-linear' accuracy improvements, it seems obvious that that is almost an admission that the multiple-rollouts phase is a failure : A log-linear curve is what one get if one simply combines multiple rollouts in a naive way! That idea is what the Colab I've made demonstrates.

The Simple Colab (Linked here)

The notebook is in two parts:

First : a simple illustration of what multiple rollouts would achieve in terms of accuracy if the probability of success for any given rollout was a constant $p=0.001$ .
The graph this produces actually 'bends upwards' - which was a surprise, since it seems that even this naive model beats the claims of OpenAI. However, playing around with $p$ shows that (under this binomial model of answer success) accuracy is just sigmoidal function of compute
Second : a more realistic scenario where there are many different questions in the test set, each with a separate value of $p$ for the corresponding rollouts. Because of the averaging across the test-set, the different curves for each question are averaged with each other, and the curves that OpenAI are showing appear.

Conclusions

The above demonstrates that the curves that OpenAI are claiming prove their test-time compute claims could also be a simple consequence of sampling a model that randomly spits out answers that has some (even small) probablity of being correct. i.e. : Log-Linear performance is nothing to shout about.

Naturally, OpenAI should be proud of their o1 model for increasing the value of $p$ for lots of problems : That's a huge achievement.

But please don't look at OpenAI's test-time compute graph and hope that when you combine the results of many model runs you'll achieve a log-linear shape! That would actually be a demonstration of poor performance...

Follow-up

FWIW, I referred to this observation briefly in my MLSG in-person talk, on this slide
Nathan Lambert published a blog post : "OpenAI's o1 using "search" was a PSYOP" pointing out a similar idea in December-2024 (i.e. 2 months after my tweet).