Sunday, July 13, 2025

Do LLMs require statistical foundations?

Weijie Su, Do Large Language Models (Really) Need Statistical Foundations?, arXiv:2505.19145v2 [stat.ME], June 2, 2025.

Abstract: Large language models (LLMs) represent a new paradigm for processing unstructured data, with applications across an unprecedented range of domains. In this paper, we address, through two arguments, whether the development and application of LLMs would genuinely benefit from foundational contributions from the statistics discipline. First, we argue affirmatively, beginning with the observation that LLMs are inherently statistical models due to their profound data dependency and stochastic generation processes, where statistical insights are naturally essential for handling variability and uncertainty. Second, we argue that the persistent black-box nature of LLMs -- stemming from their immense scale, architectural complexity, and development practices often prioritizing empirical performance over theoretical interpretability -- renders closed-form or purely mechanistic analyses generally intractable, thereby necessitating statistical approaches due to their flexibility and often demonstrated effectiveness. To substantiate these arguments, the paper outlines several research areas -- including alignment, watermarking, uncertainty quantification, evaluation, and data mixture optimization -- where statistical methodologies are critically needed and are already beginning to make valuable contributions. We conclude with a discussion suggesting that statistical research concerning LLMs will likely form a diverse “mosaic” of specialized topics rather than deriving from a single unifying theory, and highlighting the importance of timely engagement by our statistics community in LLM research.

H/t Jessica Hullman:

Something a bit cringey that becomes clearer when you see the various statistical challenges laid out like this is that sometimes they arise not just because LLMs are too complex for us to understand, but also because they are proprietary objects. E.g., once a large LLM has been trained, it’s been found that it can be more efficient to distill its knowledge into a smaller model than to train a smaller model from scratch. This motivates developers of big models to figure out ways to make their outputs resistant to distillation by competitors. It’s all just statistics I suppose, but I’d much prefer to work on problems like uncertainty quantification or watermarking outputs than how to resist sharing knowledge! Similarly, secrecy around training data curation can make it harder to theorize about dependencies between data mixtures and model capabilities.

In reading this alongside other recent takes on the state of stats in ML, it’s interesting to me that despite a growing consensus that we need to develop interpretable models to make sense of LLMs, there still seems to be a contingent of ML researchers who dismiss further integration of classical stats. For example, Su cites evaluation of LLMs as a place where we need statistically grounded methods to avoid an evaluation crisis with similarities to the replication crisis in social science, where researchers game the evaluations they present (there are various reasons to worry about this, some of which we summarized here a few years ago). But others refer to attempts to incentivize more thorough reporting of uncertainty in ML evaluation as “a weird obsession with statistics.” What’s up with that, I wonder?

No comments:

Post a Comment