Artem Kaznatcheev, Konrad Paul Kording, Nothing makes sense in deep learning, except in the light of evolution, arXiv:2205.10320
This is an interesting and imaginative article. I am particularly pleased that they regard the cultural object, in this case DL models, as the beneficiary of cultural evolution and not the human creators of the models. I believe this is the correct approach, and it seems to be what Dawkins had in mind when he first advanced the idea of memes in The Selfish Gene (1976), though memetics has not developed well as an intellectual discipline.[1] I have included the article's abstract at the end of these notes.
I want to take up two issues:
- randomness, and
- identifying roles in the evolutionary process.
Randomness
From the paper, p. 3:
As we consider the “arrival of the fittest”, the history of deep learning might seem quite different from biological evolution in one particular way: new mutations in biology are random but new ideas in deep learning do not seem to be random.
What matters, though, are what ideas become embedded in and survive in practice over the long term, for whatever value of “long” is appropriate, which is not at all obvious.
Consider a case I know better, that of music. To a first approximation no one releases a song to the marketplace with the expectation that it will fail to find an audience. Rather, they intend to reach an audience and craft the song with that intention. Audiences do not, however, care about the artist’s intentions, not the intentions of their financial backers. They care only about the music they hear. Whether or not a song will be liked, much less whether or not it will become a hit, cannot be predicted.
A similar case exists with movies. The business is notoriously fickle, but producers do everything in their power to release films that will return a profit. This has been studied by Arthur De Vany in Hollywood Economics (2004).[2] By the time a film is released we know the producer, director, screen writer, principal actors, and their records. None of those things, taken individually or collectively, allow us to predict how a film will perform at the box office. De Vany shows that at about three or four weeks into circulation, the trajectory of movie dynamics (that is, people coming to theaters to watch a movie) hits a bifurcation. Most movies enter a trajectory that leads to diminishing attendance and no profits. A few enter a trajectory that leads to continuing attendance and, eventually, a profit. Among these, a very few become block busters. We cannot predict the trajectory of an individual movie in advance.
Few objects are more deliberately crafted that movies. All the deliberation is insufficient to predict audience response. Films are too complex to allow that.
Thus I am, in principle, skeptical of Kaznatcheev’s and Kording’s claim that the evolution of DL models is not random in the way that biological evolution is. Yes, developers act in a deliberate and systematic way, but it is not at all clear to me how closely coupled those intentions are to the overall development of the field. What if, for example, the critics of deep learning, such as Gary Marcus, are proven correct at some time in the future? What happens to these models then? Do they disappear from use entirely, indicating evolutionary failure? Or perhaps they continue, but in the context of a more elaborate and sophisticated system – perhaps analogous to the evolution of eukaryotic cells from the symbiosis of simpler types of cells. That of course counts as evolutionary success.
More closely at the home, the performance of DL models seems somewhat unpredictable. For example, it is my impression that the performance of GPT-3 surprised everyone, including the people who created it. Other models have had unexpected outcomes as well. I know nothing about the expectations DL researchers may have about how traits included in a new architecture are going to affect performance metrics. But I would be surprised if very precise prediction is possible.
I don’t regard these considerations as definitive. But I do think they are reason to be very careful about claims made on the basis of developer intentions. Further investigation is needed.
Roles in the evolutionary process
It is my understanding that biological evolution involves a number of roles:
- the environment in which an organism must live and survive,
- the phenotypic traits the organism presents to that environment,
- the genetic elements that pass from one generation to the next, and
- the developmental process that leads from genetic elements to mature phenotypes.
How do Kaznatcheev and Kording assign aspects of deep learning development to parallel roles?
They explicitly assert, p. 6:
In computer science, we will consider a general specification of a model or algorithm as the scientist-facing description – usually as pseudocode or text. And we will use ‘development’ to mean every process downstream of the general specification. For a clear example – all processes during compilation or runtime would be under ‘development’. We might even consider as ‘development’ the human process of transforming pseudocode in a paper into a programming language code.
That roughly speaking is the development process.
Am I to take it then that the genetic elements are to be found in “the scientist-facing description – usually as pseudocode or text”? I don’t know. But let me be clear, I am asking out of open curiosity, not out of a desire to find fault. They know the development process far better than I do. Given what they’ve said, that scientist-facing description seems to be analogous to an organism’s genome.
Correlatively, the mature phenotype would be the code that executes the learning process. Do we think of the data on which the process is executed as part of the phenotype as well? If so, interesting, very interesting.
That leaves us with the environment in which the DL model must function. I take that to be both the range of specific metrics to which the model is subjected and the range of open-ended commentary directed toward it. Here’s a question: How is performance on specific metrics traced back to specific ‘phenotypic’ traits?
Consider a different and, it seems to me, more tractable example: automobiles. One common measure of performance is acceleration, say, from zero to 60mph. We’ve got a particular car and we want to improve its acceleration. What do we do? There is of course an enormous body of information, wisdom, and lore on this kind of thing. There are things we can do to specific automobiles once they’ve been manufactured, but there are also things we can do to redesign the car.
Where do we focus our attention? On the cylinder bore and stroke? The electrical system? The transmission. Axel, wheels, and tires? Lighter, but more expensive, materials? Perhaps we make the shape more aerodynamic? Why not all of the above.
So, we do all of the above and our new car now does 0-60 in four seconds, while the old one did it in 5.5. How do we attribute the improvement over all the differences between the new and the old models? If we can’t do that with a fair amount of accuracy, then how are we to know which design changes were important and which we not? If we don’t know that, then how do we determine which traits to keep in play in further development?
What does this imply about the role of deliberate designer intention in the evolutionary process of complex technical artifacts?
* * * * *
Finally, I note that Kaznatcheev and Kording development a major section of the article to considerations derived from EvoDevo. I have been aware of EvoDevo for years, but no little about it. So this (kind of) material is new to me.
I like what they’re doing with it. They make the point that organisms, and complex technical assemblages, have an internal coherence and dynamic the constrains how they can be modified successfully. Changes must be consistent with existing structures and mechanisms. That does enforce order on the evolutionary process.
Abstract of the Article
Deep Learning (DL) is a surprisingly successful branch of machine learning. The success of DL is usually explained by focusing analysis on a particular recent algorithm and its traits. Instead, we propose that an explanation of the success of DL must look at the population of all algorithms in the field and how they have evolved over time. We argue that cultural evolution is a useful framework to explain the success of DL. In analogy to biology, we use ‘development’ to mean the process converting the pseudocode or text description of an algorithm into a fully trained model. This includes writing the programming code, compiling and running the program, and training the model. If all parts of the process don't align well then the resultant model will be useless (if the code runs at all!). This is a constraint. A core component of evolutionary developmental biology is the concept of deconstraints – these are modification to the developmental process that avoid complete failure by automatically accommodating changes in other components. We suggest that many important innovations in DL, from neural networks themselves to hyperparameter optimization and AutoGrad, can be seen as developmental deconstraints. These deconstraints can be very helpful to both the particular algorithm in how it handles challenges in implementation and the overall field of DL in how easy it is for new ideas to be generated. We highlight how our perspective can both advance DL and lead to new insights for evolutionary biology.
References
[1] I have prepared a brief sketch laying out various approaches that are being taken to study cultural evolution: A quick guide to cultural evolution for humanists, Working Paper, November 14, 2019, 4 pp., https://www.academia.edu/40930224/A_quick_guide_to_cultural_evolution_for_humanists
[2] Arthur De Vany, Hollywood Economics: How Extreme Uncertainty Shapes the Film Industry, Routledge, 2004. I’ve written a brief review: Chaos in the Movie Biz: A Review of Hollywood Economics, New Savanna, December 9, 2018, https://new-savanna.blogspot.com/2012/05/chaos-in-movie-biz-review-of-hollywood.html
Thank you for these comments, Bill!
ReplyDeleteJust a few small notes:
Our comments on pg. 3 about randomness is meant to be the words of a potential critic. Not our own. We will make that clearer. The point of bringing in EvoDevo is to explain how apparent non-randomness at the level of phenotype could be due to randomness at the level of genotype.
You bring an interesting further complication at the level of phenotype: even if the phenotype is non-random as with movies, the resultant fitness can be highly unpredictable. This is a very deep point and speaks to the complexity of the environment. The phenotype of the movie is, however, non-random and we can notice that by considering the movies that AREN'T made (i.e., the ones that are completely non-viable). But it is really interesting that among the viable movies (i.e., those that see theatres), the actual fitness value is hard to predict. I suspect something similar is true to some extent in deep learning, but Konrad and I will have to look at the data to know for sure.
You ask what the heritable material is. We are not sure at this point, but we do expect it to be somewhere below the level of the scientist-facing description (and that is why we start the development map there). As we note early in the article: "Of course, this is a decisively qualitative account of why DL algorithms are successful. Just like the early naturalists in biology, who did not know the DNA-basis of heredity, we do not yet know how exactly the three hallmarks of evolution are implemented in the DL field. It is not as simple as the code base or the manuscript text, and requires future work to identify the hereditary basis – i.e., the “genes" or “memes" – for deep learning algorithms. The quantitative tools of modern evolutionary biology could help us answer these questions in the future."
I will have to think about the rest of questions more deeply, but I hope to be able to get back with more answers.
Thanks for your response, Artem.
DeleteThe thing about randomness is THE standard objection to the idea of cultural evolution. It is sometimes followed by the assertion that, yes, culture evolves, but it is Lamarckian, not Darwinian.
Before going on I should tell/warn you that, which I have fairly sophisticated mathematical intuitions, I have little formal training and technical skill. My intuitions come from spending a lot of time interacting with people who have skills that I do not. And, I like to look at diagrams, and to draw them.
Let’s set randomness aside, as I’m not sure what it is. But I’m OK with unpredictable. It take it to mean something like this: What have some system where we know the laws of the system and its state at T1. But we have no way to calculating its state at some arbitrarily chosen T2 (which could be before or after T1, but we’re generally looking forward). The best we can do is simulate the evolution of the system over time at whatever resolution our computational resources can support. Alas, we’re often dealing with systems where the laws are obscure and states are difficult to determine.
Here’s a highly idealized account of how the movie business works. 1). Movies start out as ideas in the minds of screen writers and initially take the form of “pitches,” a term of art. The writers will deliver their pitches to various intermediaries, agents or executives. 2). One pitch in ten will make it to the next stage, where a “treatment” – another term of art – is prepared. Treatments generally run between 3K and 10K words and spell out the story; but they are not full scripts. 3). One treatment in ten will result in a commissioned script. 4) One script in ten will go into production. 5) One production in ten will result in a movie that is released to the public. Moreover, the scenes and dialog on the screen aren’t necessarily the ones in the script that started the production process. The differences may be small or major. The whole process is known as, wouldn’t you know, development.
Which idea will make it from one step in the development pipeline to the next is uncertain. Moreover, what happens to a film when it is release, that too is uncertain. But here, at long last, we have numbers and Robert De Vany has created models. If you haven’t read his Hollywood Economics, you should.
We know that most movies don’t break even on their investment, much less make a profit. But a few will make a profit and, among those, a very few will become so-called blockbusters.
Where do we find viability in this story? Is a movie that fails to break-even and ceases to be available in theaters, is it viable? Walt Disney’s Fantasia was a flop when it was released in 1941. But Disney kept releasing it so that by 1969 it finally began earning a profit. It has been available in home media ever since and has occasionally been released in theaters.
Consider novels. We don’t know how many were published in the 18th and 19th centuries, but most of them have been forgotten. A small handful, so-called classics, are still taught in schools and read outside of institutional contexts – a few of them are even made into movies. A somewhat larger, but still small group, are studied by professional literary critics. But the rest, dead and gone.
But, in the last two decades or so many of them have been digitized. So scholars have been using computational techniques to study what’s in all those forgotten books. The results are often interesting. Alas, almost none of those scholars is interested in taking an evolutionary view of these texts.
I could go on and on, but I won’t. Let me conclude by saying I like your work very much. As you note, a lot is available in public records, so we’ve got a lot of material to examine.