Notes on compute-optimal training of large language models

05 May, 2024

Image of a chinchilla in a zoo

Computing is power-intensive. There's no getting around it: the computing industry has a hand in warming the planet. Manufacturing computers requires emitting carbon, the supply chain to move components around the world requires emitting carbon, and running the machines themselves emit waste heat, which has to be cooled, which requires more carbon emissions. If we're going to put a halt on climate change, efficient computing is going to be one of the ways we do it.

When I worked in the high-performance computing industry, one of the things they always tried to do was identify ways to gain computing efficiencies. Sometimes it was getting under the hood and replacing generic binaries with ones tailored to specific hardware. Sometimes it was changing the hardware itself to make it less resource-intensive. I gained an appreciation for the work: it thought deeply about the types of things you wanted to do with a computer, and changed everything to suit that workload.

So this paper on compute-optimized large language models catches my interest. It focuses on the fact that LLMs like ChatGPT, Claude, etc. require hundreds of billions of parameters, and they're resource-hungry. This paper from DeepMind looks for a better way to build them.

Introduction

With 500 billion parameters, it is predictable how much compute they'll require. Even at Amazon/Meta/Google scale, it's only feasible to train these huge models once, so accurately forecasting which hyperparameters they'll need is incredibly important. The authors call out prior research showing a power law relationship between the number of parameters in a language model and performance, so it makes sense that researchers are continuing to train larger and larger models.

Chart of Chinchilla parameters and performers

The authors seek to do more with less. With Chinchilla, they use fewer parameters (70B compared to ChatGPT's 175B) and get comparable results. In this work, they break down the question: given a fixed FLOPs (compute) budget, how should one trade off model size and the number of training tokens?

It's useful here to take a quick detour into some LLM-specific terminology. Tokens represent phrases, words, or parts of words - this can shift based on the choice of the tokenizer. A sentence might contain the word "redshift" which can be tokenized into ["redshift", "red", "shift"] if the tokenizer deems that appropriate. The parameters of the model (roughly) represent different ideas about the language's grammar, structure, and flow of text. An example parameter might be that "redshift" is preceded by "Amazon" X% of the time, and refers to an increase in the wavelength of light Y% of the time. The more parameters, the more complexity of language captured. The model size refers to the number of parameters in the model.

The authors go about their business by defining a loss function that anchors on a particular FLOPs budget, then tweaking the number of parameters and tokens to find the optimal combo that gets the highest performance with the given FLOPs budget. They predict that the target model will use fewer parameters and many more training tokens (1.4 trillion!) to hit its goals.

Estimation

The authors take three different approaches to predicting their model's performance, all of which yield generally similar results.

Approach 1: Fix the model size, adjust the training tokens

In this approach, the authors lock the model size (trying different numbers of parameters), and adjust the number of training tokens for each run. The authors also adjust the FLOPs budget to find the optimal model size with the lowest loss across all runs, generating a gradient of FLOPs to loss for a variety of training runs.

Performance analysis charts from Chinchilla training

Approach 2: Fix the FLOPs budget, adjust the parameters

In this approach, the authors lock the FLOPs budget and adjust the number of parameters within that FLOPs budget. It's a similar approach to the first, but this instead looks for the target token amount with a given set of parameters that yields the given FLOPs budget.

Performance analysis charts from Chinchilla training

Approach 3: Fit a parametric loss function

In this approach, the authors take the experimental runs performed in Approaches 1 and 2 and model them as a parametric function of model parameter count and token count. They identify three critical terms:

The loss for the ideal generative process on the data (which roughly corresponds to the entropy of the text)
A perfectly trained transformer with N parameters still underperforms the ideal
The transformer is not infinitely trained, and instead has a finite number of optimization steps

Optimal model scaling

All three approaches show that the model size and available training data should be scaled in equal proportions. Using the parametric loss function, the authors show different model size / training data sizes and estimate the FLOPs budget that puts the model on the efficient frontier.

Their conclusion is that the publicly available large language models (ChatGPT and the like) are considerably over-sized. Given the target compute budgets for each of these models, smaller models should have been trained on more tokens to achieve the target result. (This is part of why OpenAI and other LLM companies are so data-hungry: it saves them tremendous amounts of compute to have more available data)

Chinchilla

Now we get to the good part: actually training the authors' target model. Given the same compute budget as their prior research model Gopher, they use their research to estimate an optimal model size for Chinchilla: 70B parameters and 1.4T tokens.

Trained using the MassiveText dataset, an in-house dataset at DeepMind
Uses the AdamW optimizer
Uses the SentencePiece tokenizer
Trained on TPUv3/TPUv4 with JAX and Haiku

The authors put the model through its paces on six language processing tasks:

Language modeling (constructing human-understandable language), where it outperforms GPT-3 and Gopher
Massive Multitask Language Understanding (exam-like questions on academic subjects), where it outperforms human experts on things like law, sociology, and government, but underperforms on mathematics and formal logic
Reading Comprehension, where it outperforms Gopher
BIG-bench, a grab bag of various language tasks, where it outperforms Gopher
Common sense tasks, a grab bag of tasks testing the model's ability to do basic reasoning, where it outperforms Gopher

header image taken by Guérin Nicolas