I'm thrilled to have Jason, an AI researcher based in San Francisco, back with us to share his insights. Jason is currently working at OpenAI and was previously a research scientist at Google Brain, where he popularized key ideas in large language models (LLMs) such as Chain of Thought prompting, instruction tuning, and emergent phenomena.
Jason will be speaking for around 30 minutes, followed by a Q&A session. He will then be joined by Kuan Wang, a research scientist at OpenAI, for another 30 minutes of discussion and a joint Q&A.
Jason will be discussing the fundamental question of why language models work so well. He encourages everyone to use a tool he has found helpful in answering this question: manually inspecting data. To illustrate this, Jason shares an anecdote from 2019 when he was trying to build a lung cancer classifier. Despite being told by his advisor that he needed a medical degree and three years of pathology experience to do the task, Jason read all the papers on classifying different types of lung cancer, consulted with pathologists, and eventually gained the intuition necessary to build the classifier.
Jason then provides a quick review of language models and how they are trained using the next word prediction task. The language model outputs a probability for every single word in the vocabulary, and the goal is to make the probability of the correct next word as close to one as possible. Jason notes that next word prediction is massively multi-task learning, allowing the model to learn grammar, lexical semantics, world knowledge, traditional NLP tasks like sentiment analysis, translation, spatial reasoning, and even some math questions.
Jason then discusses the concept of scaling, which is the idea that reliably improving loss is achieved by scaling compute (which is equal to the amount of data multiplied by the size of the language model). This idea was pioneered by Kaplan et al. in 2020 and shows that the relationship between compute and loss is a law, with the x-axis being compute and the y-axis being loss. Jason notes that the line does not saturate, meaning that putting more compute or training a larger language model will continue to lead to lower loss.
Jason then raises the question of why scaling up the size of a language model improves loss, and while he admits that we don't have a good answer to this question, he offers two hand-wavey answers. The first is that larger language models are better at memorizing facts, and the second is that smaller language models tend to learn shortcuts, while larger language models can try to do more complicated things to get the next token correct.
The third intuition Jason discusses is that while overall loss improves smoothly, individual tasks can improve suddenly. This is because the overall loss can be decomposed into the loss of every single individual task, such as grammar, sentiment analysis, world knowledge, and math. Jason notes that not all tasks will improve at the same rate, and some tasks may improve suddenly.
Jason then looks at 202 tasks from the Big Bench dataset and finds that 29% of tasks are smooth, 22% are flat, 2% are inverse scaling, 13% are not correlated, and 33% are emergent abilities. Emergent abilities are defined as tasks where the performance is zero for small models and much better than random for large models. Jason notes that this is unpredictable because if you had only trained small language models, you would have predicted that it would have been impossible for the language model to ever perform the task.
Jason concludes by emphasizing the importance of plotting scaling curves when doing research. By doing so, researchers can determine whether they need to collect more data or whether they will see an improvement in performance by continuing their research project.
In summary, Jason's talk provides valuable insights into why language models work so well and the importance of scaling and emergent abilities. By manually inspecting data, researchers can gain intuition about tasks and develop more effective language models. Plotting scaling curves can also help researchers make predictions about the future trajectory of language models and determine the most effective research strategies.