Tuesday, December 13, 2022

Language Models and future progress in biology

Sam Rodriques, Why is progress in biology so slow? He's particularly interested in research leading to therapeutic uses. Three topics: Speed (the slow pace of research), Knowledge, and Talent. This is most of the section on knowledge:

The biomedical literature is vast and suffers from three problems: it does not lend itself to summarization in textbooks; it is unreliable by commission; and it is unreliable by omission. The first problem is simple: biology is too diverse. Every disease, every gene, every organism, and every cell type is its own grand challenge. The second problem is trickier — some things in the literature are simply wrong, made up by trainees or professors who were desperate to publish rather than perish. But it is the third problem that is really pernicious: many things in the literature are uninterpretable or misleading due to the omission of key details by the authors, intentional or otherwise. Authors may report a new, general strategy for targeting nanoparticles to cells expressing specific receptor proteins and show that it works for HER2 and EGFR, while declining to mention that it does not work for any one of the 20 other receptors they tried. Other times, a lab may decline to mention that their method only works above 50% relative humidity, because they never realized that was an issue. The unreliability of he literature by commission means that fundamentally, if you want to know that an experiment really works, you have to try it yourself. The unreliability by omission means that if you want to understand the real limitations of a technique, you usually have to work on it for 6 months to figure out how it really works and why.

How much progress we can make in biology in the next hundred years will depend on the extent to which the language models are able to solve these problems. The primary questions here will be: what fraction of knowledge in the world can be generalized from knowledge already in the literature, and how valuable is literature synthesis when a large portion of it is incorrect? The problem of summarization will be solved by language models soon. Within a few years at most (and maybe in a few months), every lab will have immediate access to the world’s expert in all of biology, which will happily educate them about the state of the art and the relevant subtleties in whatever field they choose. Simply tell the language model exactly what you want to do, and it will summarize for you everything relevant that is known, thereby avoiding the “if only I had known this” problem. It seems to me that the unreliability of the literature, however, will mean that for the foreseeable future (essentially until we have fully parameterized lab automation), AIs will be better at suggesting experiments and interpreting results than they will be at drawing conclusions from existing literature.

There’s more at the link.

H/t Tyler Cowen.

No comments:

Post a Comment