Published on

o1 - Are the curves awesome?

Statistical Debunking
Authors

Open AI's Blog post announcing the o1 model contained charts that showed the training and test-time performance of the new model.

OpenAI's blog-post chart

Clearly, the new o1 model is groundbreaking - and I'll be talking about how it represents a new direction in which compute can scale towards better performance at the September-2024 Machine Learning Singapore MeetUp. However, people seem to be over-interpreting the graphs provided, which (IMHO) do not demonstrate the breakthrough in test-time performance that is being attributed to the o1 innovation : There's a simpler explanation.

I tweeted out about the issue, and attached a Colab notebook that demonstrates how people also not parsing out OpenAI's messaging carefully (combined with OpenAI being in no hurry to correct people's misconceptions):

Tweet on 2024-09-14

What does the test-time curve say?

There are two levels to the o1 model compute usage:

  • the variable length 'reasoning' that the model does for a given 'rollout' - it seems like the model will think for longer for harder tasks. Although the blog post only shows a limited number of these rollouts (which include some backtracking / rethinking / etc), the model is clearly doing something interesting and new here (compared to conventional LLMs).
  • the number of different rollouts allowed for a given problem. This is part of the final test score : Multiple rollouts are combined in some way (e.g. voting, ranking)

While OpenAI seems keen to emphasise how they've achieved 'log-linear' accuracy improvements, it seems obvious that that is almost an admission that the multiple-rollouts phase is a failure : A log-linear curve is what one get if one simply combines multiple rollouts in a naive way! That idea is what the Colab I've made demonstrates.

The Simple Colab (Linked here)

The notebook is in two parts:

  • First : a simple illustration of what multiple rollouts would achieve in terms of accuracy if the probability of success for any given rollout was a constant p=0.001p=0.001.
    The graph this produces actually 'bends upwards' - which was a surprise, since it seems that even this naive model beats the claims of OpenAI. However, playing around with pp shows that (under this binomial model of answer success) accuracy is just sigmoidal function of compute
  • Second : a more realistic scenario where there are many different questions in the test set, each with a separate value of pp for the corresponding rollouts. Because of the averaging across the test-set, the different curves for each question are averaged with each other, and the curves that OpenAI are showing appear.

Conclusions

The above demonstrates that the curves that OpenAI are claiming prove their test-time compute claims could also be a simple consequence of sampling a model that randomly spits out answers that has some (even small) probablity of being correct. i.e. : Log-Linear performance is nothing to shout about.

Naturally, OpenAI should be proud of their o1 model for increasing the value of pp for lots of problems : That's a huge achievement.

But please don't look at OpenAI's test-time compute graph and hope that when you combine the results of many model runs you'll achieve a log-linear shape! That would actually be a demonstration of poor performance...

Follow-up