The paper sought to answer this question: for a given FLOP budget, what is the optimal trade-off between parameters and dataset size? pic.twitter.com/9RiEFHK6OT
— Matthew Barnett (@MatthewJBar) March 31, 2022
However, in the DeepMind paper, they found that OpenAI's conclusion was premature. The main reason is that Kaplan et al. had used a fixed learning rate schedule for all their training runs, causing them underestimate the returns from scaling data relative to parameters. pic.twitter.com/cphsej7vQU
— Matthew Barnett (@MatthewJBar) March 31, 2022
To test this result, they train a 70 billion parameter model — Chinchilla — that achieves state of the art performance on Hendrycks et al.'s Massive Multitask Language Understanding dataset, a hard NLP benchmark. It even outperformed forecaster expectations out to June 2023! pic.twitter.com/gbuQSQrVA9
— Matthew Barnett (@MatthewJBar) March 31, 2022
In conclusion: we are leaving the era in which it's beneficial for companies to rapidly scale up parameter counts much higher than the current level of ~500 billion. But the era of scaling up our datasets to optimally accommodate our compute budgets has just begun.
— Matthew Barnett (@MatthewJBar) March 31, 2022
Abstract from the linked article:
We investigate the optimal model size and number of tokens for training a transformer language model under a given compute budget. We find that current large language models are significantly under- trained, a consequence of the recent focus on scaling language models whilst keeping the amount of training data constant. By training over 400 language models ranging from 70 million to over 16 billion parameters on 5 to 500 billion tokens, we find that for compute-optimal training, the model size and the number of training tokens should be scaled equally: for every doubling of model size the number of training tokens should also be doubled. We test this hypothesis by training a predicted compute-optimal model, Chinchilla, that uses the same compute budget as Gopher but with 70B parameters and 4× more more data. Chinchilla uniformly and significantly outperforms Gopher (280B), GPT-3 (175B), Jurassic-1 (178B), and Megatron-Turing NLG (530B) on a large range of downstream evaluation tasks. This also means that Chinchilla uses substantially less compute for fine-tuning and inference, greatly facilitating downstream usage. As a highlight, Chinchilla reaches a state-of-the-art average accuracy of 67.5% on the MMLU benchmark, greater than a 7% improvement over Gopher.
No comments:
Post a Comment