The Bitter Lesson: Computational Scalability Conquers Time

Short Note

General methods that leverage computation are ultimately the most effective, and by a large margin.

A model = Computation + Human prior knowledge

  • Computation cost drops exponentially by Moore’s law.
  • Human labour cost would stay constant

When one spend more time on leverage computation, less time there will be on human knowledge. Thus tradeoff must be made.

In a short time window. All projects would share hardware of same level, leveraging more human knowledge becomes the only promising way to boost performance. However, over a slightly longer time than a typical research project, massively more computation inevitably becomes available.

  • Researchers seek to leverage their human knowledge of the domain, but the only thing that matters in the long run is the leveraging of computation.

Insight from Rich Sutton

What we see in history

  1. AI researchers have often tried to build knowledge into their agents,
  2. This always helps in the short term, and is personally satisfying to the researcher, but
  3. In the long run it plateaus and even inhibits further progress
  4. Breakthrough progress eventually arrives by an opposing approach based on scaling computation by search and learning. The eventual success is tinged with bitterness, and often incompletely digested, because it is success over a favored, human-centric approach.

What we should pay attention:

  1. We should be learned from the bitter lesson is the great power of general purpose methods, of methods that continue to scale with increased computation even as the available computation becomes very great.
    • The two methods that seem to scale arbitrarily in this way are search and learning.
  2. World, mind are too complex to model in detail. They are not what should be built in, as their complexity is endless; instead we should build in only the meta-methods that can find and capture this arbitrary complexity.


However, human knowledge in the same domain will accumulate and potentially improve future project, while computation cannot. Therefore, a better model would be that, computation cost drops exponentially, while human knowledge accumulates linearly. However, it does not affect any of the argument the author made.

Lessons from 70 Years of AI Research

All copied from original blog.

Computer Chess/Go

“In computer chess, the methods that defeated the world champion, Kasparov, in 1997, were based on massive, deep search. At the time, this was looked upon with dismay by the majority of computer-chess researchers who had pursued methods that leveraged human understanding of the special structure of chess. When a simpler, search-based approach with special hardware and software proved vastly more effective, these human-knowledge-based chess researchers were not good losers. They said that ``brute force” search may have won this time, but it was not a general strategy, and anyway it was not how people played chess. These researchers wanted methods based on human input to win and were disappointed when they did not."

“A similar pattern of research progress was seen in computer Go, only delayed by a further 20 years. Enormous initial efforts went into avoiding search by taking advantage of human knowledge, or of the special features of the game, but all those efforts proved irrelevant, or worse, once search was applied effectively at scale. Also important was the use of learning by self play to learn a value function (as it was in many other games and even in chess, although learning did not play a big role in the 1997 program that first beat a world champion). Learning by self play, and learning in general, is like search in that it enables massive computation to be brought to bear. Search and learning are the two most important classes of techniques for utilizing massive amounts of computation in AI research. In computer Go, as in computer chess, researchers' initial effort was directed towards utilizing human understanding (so that less search was needed) and only much later was much greater success had by embracing search and learning.”

Speech Recognition

“In speech recognition, there was an early competition, sponsored by DARPA, in the 1970s. Entrants included a host of special methods that took advantage of human knowledge—knowledge of words, of phonemes, of the human vocal tract, etc. On the other side were newer methods that were more statistical in nature and did much more computation, based on hidden Markov models (HMMs). Again, the statistical methods won out over the human-knowledge-based methods. This led to a major change in all of natural language processing, gradually over decades, where statistics and computation came to dominate the field. The recent rise of deep learning in speech recognition is the most recent step in this consistent direction. Deep learning methods rely even less on human knowledge, and use even more computation, together with learning on huge training sets, to produce dramatically better speech recognition systems. As in the games, researchers always tried to make systems that worked the way the researchers thought their own minds worked—they tried to put that knowledge in their systems—but it proved ultimately counterproductive, and a colossal waste of researcher’s time, when, through Moore’s law, massive computation became available and a means was found to put it to good use.”"

Computer Vision

“In computer vision, there has been a similar pattern. Early methods conceived of vision as searching for edges, or generalized cylinders, or in terms of SIFT features. But today all this is discarded. Modern deep-learning neural networks use only the notions of convolution and certain kinds of invariances, and perform much better.”