
If you pay as much attention to the world of AI as I do, it’s become impossible to ignore the hype surrounding the imminence of AGI (artificial general intelligence). A lot of researchers at the big AI labs are convinced they already know how to “solve” AGI, and that it’s only a matter of time (and not very much time at that).
The dots are all beginning to connect now
— Bindu Reddy (@bindureddy) January 25, 2025
We are literally months away!
FEEL THE AGI
The newest big news comes from China, where DeepSeek has released a model name R1, which can be thought of as an open-source replication of OpenAI’s o1.
🚀 DeepSeek-R1 is here!
— DeepSeek (@deepseek_ai) January 20, 2025
⚡ Performance on par with OpenAI-o1
📖 Fully open-source model & technical report
🏆 MIT licensed: Distill & commercialize freely!
🌐 Website & API are live now! Try DeepThink at https://t.co/v1TFy7LHNy today!
🐋 1/n pic.twitter.com/7BlpWAPu6y
This comes shortly after OpenAI’s announcement of o1’s successor, o3, which achieved new state-of-the-art performance on a couple of very challenging benchmarks.
Today OpenAI announced o3, its next-gen reasoning model. We've worked with OpenAI to test it on ARC-AGI, and we believe it represents a significant breakthrough in getting AI to adapt to novel tasks.
— François Chollet (@fchollet) December 20, 2024
It scores 75.7% on the semi-private eval in low-compute mode (for $20 per task… pic.twitter.com/ESQ9CNVCEA
I haven’t seen o3 yet & have been critical of benchmarks for AI but they did test against some of the hardest & best ones
— Ethan Mollick (@emollick) December 21, 2024
On GPQA, PhDs with access to the internet got 34% outside their specialty, up to 81% inside. o3 is 87%.
Frontier Math went from the best AI at 2% to 25% pic.twitter.com/tlutPDss6V
So, does this mean the hype is real? Is Skynet about to take over the world? Personally, I remain skeptical of most of the hyperbolic claims we see on a daily basis these days.
For starters, consider Goodhart’s Law. As an ML researcher, if you focus on a particular benchmark, you will often find that mastering that benchmark is easy (relatively speaking), but that once you’ve achieved greatness on that benchmark, your model rarely holds up out-of-distribution. So even though OpenAI’s newest “reasoning” models have impressive SOA results, there’s no guarantee these models would do as well on a completely new dataset of similar difficulty.
Furthermore, there now appears to have been some shenanigans regarding OpenAI’s achievement on FrontierMath:
This is absolutely wild. OpenAI had access to all of FrontierMath data from the beginning. Anyone who knows ML will tell you don’t need to explicitly use the data in your training set (although there is no guarantee of that it did not happen here) to contaminate your model.
— Delip Rao e/σ (@deliprao) January 19, 2025
I… pic.twitter.com/ZYEbC5e3Nc
ASI > AGI
I’ve come around to the view that no benchmark, at least in the way we think about benchmarks today, will ever tell us when we’ve achieved AGI. Smart AI researchers will continue to come up with newer and more challenging benchmarks, and other researchers will continue to master them, even with models that clearly aren’t AGI. Eventually, AGI will be achieved, but no standard benchmark will be able to tell us that it has arrived.
So what are we left with? My belief now is that we will have to produce ASI (artificial super intelligence) before we can actually be sure we’ve reached AGI. Only in retrospect will we able to know with certainty that someone has produced AGI.
This may seem counterintuitive, as ASI must reach higher levels of intelligence than AGI. But I think that testing for ASI will actually be much easier than AGI. How can we do that?
I’ll borrow from theoretical computer science as an analogy: P versus NP problems.
It’s believed that P != NP, but no one has actually been able to prove this. If the conjecture is true, problems that fit into the NP set are easy to verify, but incredibly difficult to solve. So, if someone much smarter than you (an ASI for example), gives you, a dumb human, the answer to a certain type of problem you couldn’t solve, you can still verify the solution is correct.
So now we just need to define a set of problems we believe are solvable, but which no one, not even the smartest among us, has been able to solve. We just need to be able to verify the ASI-generated solution is correct.
And you might’ve already guessed that P vs NP is one of said problems, and probably the one I’d put at the top of the list. Here’s a list of possible problems I asked for from ChatGPT.
My best guess is that we’ll be able to solve these problems with ASI within the next 10 years, though 15 is probably a safer bet. Even that’s not certain, though, and I think we’ll see AI’s that give the appearance of AGI as part of everyday life even before that, even if they aren’t truly intelligent.













