Dohare, S., Hernandez-Garcia, J.F., Lan, Q. et al. Loss of plasticity in deep continual learning. Nature 632, 768–774 (2024). https://doi.org/10.1038/s41586-024-07711-7
Abstract: Artificial neural networks, deep-learning methods and the backpropagation algorithm1 form the foundation of modern machine learning and artificial intelligence. These methods are almost always used in two phases, one in which the weights of the network are updated and one in which the weights are held constant while the network is used or evaluated. This contrasts with natural learning and many applications, which require continual learning. It has been unclear whether or not deep learning methods work in continual learning settings. Here we show that they do not—that standard deep-learning methods gradually lose plasticity in continual-learning settings until they learn no better than a shallow network. We show such loss of plasticity using the classic ImageNet dataset and reinforcement-learning problems across a wide range of variations in the network and the learning algorithm. Plasticity is maintained indefinitely only by algorithms that continually inject diversity into the network, such as our continual backpropagation algorithm, a variation of backpropagation in which a small fraction of less-used units are continually and randomly reinitialized. Our results indicate that methods based on gradient descent are not enough—that sustained deep learning requires a random, non-gradient component to maintain variability and plasticity.
That seems reminiscent of an idea that William Powers had years ago in Behavior: The Control of Perception, pp. 179 ff. These passages from the book may provide some intuition (pay attention to the Pask example):
One example of a self-reorganizing system was Ashby’s (1952) homeostat, a collection of four simple feedback control systems which also contained a separate “uniselector” capable of altering the system’s behavioral organization until a specific “survival” condition was satisfied. The homeostat could survive something that no computer program, however adaptive, could survive—an attack with a pair of wire-cutters. If one operational connection was destroyed, the uniselector could substitute another one. The uniselector itself produced no behavior; it acted to alter the physi- cal connections in the behaving system.
Gordon Pask (1960) also built a device demonstrating physical reorganization. His device was a tray of iron-salt solution in which electrically conductive crystals could grow when direct current was applied to electrodes in the solution. These crystals would grow so as to complete connections between input and output ter- minals. Pask “rewarded” the tray of solution for making a de- sired connection by giving it some D.C. current, and “punished” it by withholding current, allowing the acid solution to dissolve the crystals. In this way he “trained” the solution tray to react in some absolutely astonishing ways. For example, he discovered that the network of crystal threads could be trained to discrimi- nate between vibrations caused by two audible tones of different pitch!
This is what I mean by reorganization—not a change in the way existing components of a system are employed under control of recorded information, but a change in the properties or even the number of components. This category of learning is clearly the most fundamental, for it affects the kind of information that will be perceived and the kinds of computing elements available for use in programming.
See my post, Consciousness, reorganization and polyviscosity, Part 1: The link to Powers.
No comments:
Post a Comment