On GenAI's Impact on the world as we know it

On GenAI's Impact on the world as we know it
Measuring AI Ability to Complete Long Tasks
Despite rapid progress on AI benchmarks, the real-world meaning of benchmark performance remains unclear. To quantify the capabilities of AI systems in terms of human capabilities, we propose a new metric: 50%-task-completion time horizon. This is the time humans typically take to complete tasks that AI models can complete with 50% success rate. We first timed humans with relevant domain expertise on a combination of RE-Bench, HCAST, and 66 novel shorter tasks. On these tasks, current frontier AI models such as Claude 3.7 Sonnet have a 50% time horizon of around 50 minutes. Furthermore, frontier AI time horizon has been doubling approximately every seven months since 2019, though the trend may have accelerated in 2024. The increase in AI models’ time horizons seems to be primarily driven by greater reliability and ability to adapt to mistakes, combined with better logical reasoning and tool use capabilities. We discuss the limitations of our results -- including their degree of external validity -- and the implications of increased autonomy for dangerous capabilities. If these results generalize to real-world software tasks, extrapolation of this trend predicts that within 5 years, AI systems will be capable of automating many software tasks that currently take humans a month.

Written in 2025:

On these tasks, current frontier AI models such as Claude 3.7 Sonnet have a 50% time horizon of around 50 minutes [...]
If these results generalize to real-world software tasks, extrapolation of this trend predicts that within 5 years, AI systems will be capable of automating many software tasks that currently take humans a month.

The implications of this are huge. A month's worth of human work is worth more than 10 000$. You'll be able to get that for a few hundred bucks of tokens - It really raises the question what effect this would have on society...

Well, then I propose we develop a framework to compare this with other technological revolutions in Humanity! Don't expect intense mathematical rigour or thoroughly researched numbers here. We're working off of intuition. But intuition is generally good enough to guess orders of magnitude-sized differences between things.

A revolution's impact is probably correlated with the magnitude of efficiency gains [1], the breadth (how many people this affects) and the stakes (how much it affects people). Some examples, to set the scale:

  • Someone invents the fortune-cookie folding machine. 50x gains in efficiency, but it applies in very specific conditions and the stakes are pretty low. <1 on the Richter scale.
  • The barcode scanner. Revolutionized inventory management, made modern businesses possible. 4-5 on the Richter scale.
  • Tractors replacing horses for farm work - saves 2-10x costs on a core activity for humanity (the eating part, not specifically farming). 8-9 on the Richter scale. World-shaking.
  • The printing press - Makes information 90x cheaper to disseminate. Information is the backbone of today's society, so I'd say the breadth*stakes are pretty high. 8-9 on the Richter scale.

Conservatively, my guess for GenAI is a 7-8 (it's at least as impactful as when the Google search engine was invented), most likely in the 7-9 range. It'll really depend on how much of the predictions in the article above materialize, adoption and actual impact on society...


[1] The important part is not that it reduces cost, it's that it makes this resource this many more times available to humanity. And due to Jevon's paradox, as something gets cheaper, we usually consume more and more of it, making it even more important to humanity.