Sunday, March 10, 2024

A.I. and biology [search and smarts]

Carl Zimmer, A.I. Is Learning What It Means to Be Alive, NYTimes, Mar. 10, 2024. Here's how the article opens:

In 1889, a French doctor named Francois-Gilbert Viault climbed down from a mountain in the Andes, drew blood from his arm and inspected it under a microscope. Dr. Viault’s red blood cells, which ferry oxygen, had surged 42 percent. He had discovered a mysterious power of the human body: When it needs more of these crucial cells, it can make them on demand.

In the early 1900s, scientists theorized that a hormone was the cause. They called the theoretical hormone erythropoietin, or “red maker” in Greek. Seven decades later, researchers found actual erythropoietin after filtering 670 gallons of urine.

And about 50 years after that, biologists in Israel announced they had found a rare kidney cell that makes the hormone when oxygen drops too low. It’s called the Norn cell, named after the Norse deities who were believed to control human fate.

It took humans 134 years to discover Norn cells. Last summer, computers in California discovered them on their own in just six weeks.

The discovery came about when researchers at Stanford programmed the computers to teach themselves biology. The computers ran an artificial intelligence program similar to ChatGPT, the popular bot that became fluent with language after training on billions of pieces of text from the internet. But the Stanford researchers trained their computers on raw data about millions of real cells and their chemical and genetic makeup.

The researchers did not tell the computers what these measurements meant. They did not explain that different kinds of cells have different biochemical profiles. They did not define which cells catch light in our eyes, for example, or which ones make antibodies.

The computers crunched the data on their own, creating a model of all the cells based on their similarity to each other in a vast, multidimensional space. When the machines were done, they had learned an astonishing amount. They could classify a cell they had never seen before as one of over 1,000 different types. One of those was the Norn cell.

“That’s remarkable, because nobody ever told the model that a Norn cell exists in the kidney,” said Jure Leskovec, a computer scientist at Stanford who trained the computers.

The software is one of several new A.I.-powered programs, known as foundation models, that are setting their sights on the fundamentals of biology. The models are not simply tidying up the information that biologists are collecting. They are making discoveries about how genes work and how cells develop.

I'm of two minds about that. One the one hand: Remarkable! Yes, remarkable. OTOH: WTF? Whatever that computer (with its foundation model) did, it was working from a very large body of information, much of which was not available to Viault in 1889, nor to scientists in the early 1900s. Where did that knowledge come from? From humans, that's where. We accumulated that knowledge, fed it into a computer, and it used it to discover something we already knew. What's so remarkable about that?

That the computer was able to do that in only 6 weeks, yes, I suppose it's remarkable. But I'm not sure just why or how. And on the whole I object to the framing: Dumb humans took 134 years while smart computer took only six weeks.

Yes, Zimmer wanted to get our attention. OK, he's got it. Now what's he do with it? As far as I can tell at this point in the article – I've not read much farther – there are two things going on in that computer: 1) search through lots of stuff, and 2) recognition of something interesting. Both are important, necessary. Search just takes time and patience, lots of it. Recognition of something interesting, that takes smarts, intelligence if you will.

Going back to the original example, the discovery of Norn cells, how do we apportion those 134 years between smarts and search? Viaut's recognition that the surge in red blood cells was important, that took smarts. But why did he draw his blood and examine it under a microscope? That took smarts as well. Theorizing that a hormone is involved, more smarts. Filtering 670 gallons of urine, search, albeit very crude search. Zimmer doesn't tell us what those Israeli biologists did. I'd guess a combination of search and smarts. I'm sure that computers can help enormously with search, search routines make up a big chunk of practical programming know-how and are of theoretical interest as well. Increasingly, they've got smarts. But they're two different things. (Wolfram talks about search and smarts in the article I just excerpted, though not quite in those terms.)

Let's return to Zimmer's article, picking up where we left off:

The software is one of several new A.I.-powered programs, known as foundation models, that are setting their sights on the fundamentals of biology. The models are not simply tidying up the information that biologists are collecting. They are making discoveries about how genes work and how cells develop.

As the models scale up, with ever more laboratory data and computing power, scientists predict that they will start making more profound discoveries. They may reveal secrets about cancer and other diseases. They may figure out recipes for turning one kind of cell into another.

“A vital discovery about biology that otherwise would not have been made by the biologists — I think we’re going to see that at some point,” said Dr. Eric Topol, the director of the Scripps Research Translational Institute.

Just how far they will go is a matter of debate. While some skeptics think the models are going to hit a wall, more optimistic scientists believe that foundation models will even tackle the biggest biological question of them all: What separates life from nonlife?

Debate, yes, that's good. As for attacking the difference between life and nonlife, are these foundation models going to do that on their own, or are they going to be guided by humans? Are we talking about genius-level computational creativity or the helpfulness of a really good assistant? Perhaps we talking about something that's neither, but has aspects of both, something we don't yet understand, but we're working on it.

Skipping over some stuff, Zimmer tells us something about that foundation model at Stanford:

The Stanford team got into the foundation-model business after helping to build one of the biggest databases of cells in the world, known as CellXGene. Beginning in August, the researchers trained their computers on the 33 million cells in the database, focusing on a type of genetic information called messenger RNA. They also fed the model the three-dimensional structures of proteins, which are the products of genes.

From this data, the model — known as Universal Cell Embedding, or U.C.E. — calculated the similarity among cells, grouping them into more than 1,000 clusters according to how they used their genes. The clusters corresponded to types of cells discovered by generations of biologists.

U.C.E. also taught itself some important things about how the cells develop from a single fertilized egg. For example, U.C.E. recognized that all the cells in the body can be grouped according to which of three layers they came from in the early embryo.

“It essentially rediscovered developmental biology,” said Stephen Quake, a biophysicist at Stanford who helped develop U.C.E.

The model was also able to transfer its knowledge to new species. Presented with the genetic profile of cells from an animal that it had never seen before — a naked mole rat, say — U.C.E. could identify many of its cell types.

“You can bring a completely new organism — chicken, frog, fish, whatever — you can put it in, and you will get something useful out,” Dr. Leskovec said.

We're getting somewhere. Still... The article marches on. Interesting, every bit of it. But the framing, the framing...Zimmer's the one who wrote the article, so he's ultimately responsible for the framing. But he can only work with what the experts give him. And they don't know quite what their remarkable machines are doing. We're got a lot to learn. Zimmer goes on: "Just like ChatGPT, biological models sometimes get things wrong." Ah, yes, perhaps not magic. And then:

Scientists are also developing tools that let foundation models combine what they’re learning on their own with what flesh-and-blood biologists have already discovered. The idea would be to connect the findings in thousands of published scientific papers to the databases of cell measurements.

The risks:

If foundation models live up to Dr. Quake’s dreams, they will also raise a number of new risks. On Friday, more than 80 biologists and A.I. experts signed a call for the technology to be regulated so that it cannot be used to create new biological weapons. Such a concern might apply to new kinds of cells produced by the models.

Read the whole thing. And here's the abstract of an article about that Stanford foundation model:

Yanay Rosen, Yusuf Roohani, Ayush Agarwal, Leon Samotorčan, Tabula Sapiens Consortium, Stephen R. Quake, Jure Leskovec bioRxiv 2023.11.28.568918; doi: https://doi.org/10.1101/2023.11.28.568918

Abstract: Developing a universal representation of cells which encompasses the tremendous molecular diversity of cell types within the human body and more generally, across species, would be transformative for cell biology. Recent work using single-cell transcriptomic approaches to create molecular definitions of cell types in the form of cell atlases has provided the necessary data for such an endeavor. Here, we present the Universal Cell Embedding (UCE) foundation model. UCE was trained on a corpus of cell atlas data from human and other species in a completely self-supervised way without any data annotations. UCE offers a unified biological latent space that can represent any cell, regardless of tissue or species. This universal cell embedding captures important biological variation despite the presence of experimental noise across diverse datasets. An important aspect of UCE’s universality is that any new cell from any organism can be mapped to this embedding space with no additional data labeling, model training or fine-tuning. We applied UCE to create the Integrated Mega-scale Atlas, embedding 36 million cells, with more than 1,000 uniquely named cell types, from hundreds of experiments, dozens of tissues and eight species. We uncovered new insights about the organization of cell types and tissues within this universal cell embedding space, and leveraged it to infer function of newly discovered cell types. UCE’s embedding space exhibits emergent behavior, uncovering new biology that it was never explicitly trained for, such as identifying developmental lineages and embedding data from novel species not included in the training set. Overall, by enabling a universal representation for every cell state and type, UCE provides a valuable tool for analysis, annotation and hypothesis generation as the scale and diversity of single cell datasets continues to grow.

No comments:

Post a Comment