“Is bigger better? And, should compute be the foundation of governing risk for generative AI?”

 

Speaking about the artificial intelligence (AI) models that exist today, Sam Altman, CEO of OpenAI, said: “GPT-4 is the dumbest model any of you will ever, ever have to use again, by a lot.” 

As AI companies like OpenAI, xAI, Google, Meta, and Anthropic are throwing billions of dollars into improving AI models, the expectation is that AI models will continue to improve. 

Traditionally, making the models larger by training them on more data has been thought to be the most effective way to improve them. ‘Just make it bigger’ seems to be the motto here. And so far? It works. Bigger models — on average — outperform smaller ones.

But, “Is bigger better? And, should compute be the foundation of governing risk for generative AI?” ‘Compute’ refers to the computational resources that are needed to train and run an AI model. These were the questions posed by the keynote speaker, Sara Hooker, an expert in machine learning and the leader of Cohere for AI, at the Absolutely Interdisciplinary AI conference held on May 29. 

This conference, hosted by the Schwartz Reisman Institute for Science and Technology at U of T, brought together researchers and students to discuss the challenges presented by AI as its usage grows more common.

With new technology like AI, we must be cognizant of the risks it poses. So far, the AI industry and policymakers have taken a default stance that the risk associated with larger models is higher than that of smaller ones. How does this policy need to change to better fit current AI models and their risks?

Current assumptions about risk

Hooker spoke about how both the US and the European Union have suggested regulation based on a threshold compute size, so that “models trained above a certain threshold of computer [size] should be subject to more scrutiny.” Companies building these larger models would have to provide reports to the government, including information on how the models are trained.

AI companies have oriented themselves with the view that scaling is inevitable — that models are going to keep getting bigger, and will continue to be trained on more data — so effort must be directed towards responsible scaling and minimizing risk.

Risks associated with AI models include hallucinations — the tendency of AI models to cook up information that is not true. Additionally, there are privacy concerns — the models may accidentally give out proprietary or private information. Moreover, models can be biased — the models may amplify stereotypes or produce hateful texts. Finally, creators can lose control — the models may act in ways the creators cannot reliably control. 

When adding more parameters to make a model larger, it is hard to predict how its output is going to change — this means that there is more room for these risks to arise. This is why governments and AI companies seem convinced that the compute size of a model is a good predictor of the risk it poses. Is this true?

Alternate methods to scaling 

Bigger models perform better than smaller ones, but there are important caveats. Making a model bigger only slightly increases its performance now that models have already become so large. According to Hooker, “If you double [the compute size of your model], you get… a measly two percentage points [of performance].”

The diminishing returns are due to many reasons, including exploding computational and energy costs, as well as limited high-quality training data. As OpenAI’s co-founder, Ilya Sutskever, put it at the NeurIPS conference in Vancouver in 2024, “We have but one internet.” 

In fact, pruning a model, or scaling it down after it has been trained, does not significantly impact the performance. Once it has been pruned, what you end up with is a relatively smaller model with the performance — and therefore risk potential — of a larger model.

The boost in performance you could get by increasing the compute size of a model has thus started to plateau. More and more, we see smaller models being comparable in abilities to larger ones. These smaller models are getting ahead by using curated training data and specializing in certain tasks, such as coding. Since they are trained with data for a specific task, they don’t need to be trained on data that is irrelevant to that task, increasing their performance and reliability in the chosen domain.

The traditional ‘bigger is better’ scaling law has therefore wavered, and compute size can no longer be used as the only accurate predictor of performance. So, if smaller models can be as risky as larger models, policies need to start accounting for this paradigm change. 

The present hard-coded thresholds of size do not take into account these nuances. Different AI applications also require different sizes of models, but that does not automatically make them any safer than other models: the risks like biases, hallucination, and privacy concerns persist. 

Hooker concluded her presentation by saying that AI risks are “hard to predict, especially in the future, and policy should be informed by scientific evidence, and be transparent about what risks [are present.]”