Luo, X., Rechardt, A., Sun, G. et al. Large language models surpass human experts in predicting neuroscience results. Nat Hum Behav 9, 305–315 (2025). https://doi.org/10.1038/s41562-024-02046-9
Abstract: Scientific discoveries often hinge on synthesizing decades of research, a task that potentially outstrips human information processing capacities. Large language models (LLMs) offer a solution. LLMs trained on the vast scientific literature could potentially integrate noisy yet interrelated findings to forecast novel results better than human experts. Here, to evaluate this possibility, we created BrainBench, a forward-looking benchmark for predicting neuroscience results. We find that LLMs surpass experts in predicting experimental outcomes. BrainGPT, an LLM we tuned on the neuroscience literature, performed better yet. Like human experts, when LLMs indicated high confidence in their predictions, their responses were more likely to be correct, which presages a future where LLMs assist humans in making discoveries. Our approach is not neuroscience specific and is transferable to other knowledge-intensive endeavours.
I find this interesting and heartening because I work across a wide range of areas and disciplines – cognition, language, culture, neuroscience, AI – and like to integrate over that range. Obviously not over the entire range at once, but over two or three, perhaps even four, in a single complex document. I've been have LLMs comment on some of my more ambitious papers, which is illuminating. I'm also using LLMs, mostly Claude, to develop ideas across a range of disciplines. Two examples where I publish an interaction with Claude:
See spider graph rhs of Fig S.19 showing... BrainBench has humans covered. 1,000 words+, one image.
ReplyDelete"Figure S.19: LLMs outperformed human experts on BrainBench (using GPT-4 created test cases). Base versions of models outperformed chat and instruct versions, which were tuned to be conversational with humans.".