How long is that road? How many steps beyond Perceiver are there? Two, Five, 11, 29, 53, 378,....? Obviously we haven't a clue, still...
From the article:
Last year, at the International Solid State Circuits Conference, an annual technical symposium held in San Francisco, Google's Dean described in his keynote address one future direction of deep learning as the "goal of being able to train a model that can perform thousands or millions of tasks in a single model."
"Building a single machine learning system that can handle millions of tasks … is a true grand challenge in the field of artificial intelligence and computer systems engineering," said Dean.
In a conversation with ZDNet at the conference, Dean explained how a kind of super-model would build up from work over the years on neural networks that combine "modalities," different sorts of input such as text and image, and combinations of models known as "mixture of experts":
Mixture of experts-style approaches, I think, are going to be important, and multi-task, and multi-modal approaches, where you sort-of learn representations that are useful for many different things, and sort-of jointly learn good representations that help you be able to solve new tasks more quickly, and with less data, fewer examples of your task, because you are already leveraging all the things you already know about the world.
Perceiver is in the spirit of that multi-tasking approach. It takes in three kinds of inputs: images, videos, and what are called point clouds, a collection of dots that describes what a LiDAR sensor on top of a car "sees" of the road.
Once the system is trained, it can perform with some meaningful results on benchmark tests, including the classic ImageNet test of image recognition; Audio Set, a test developed at Google that requires a neural net to pick out kinds of audio clips from a video; and ModelNet, a test developed in 2015 at Princeton whereby a neural net must use 2,000 points in space to correctly identify an object.
If we look at the human brain, to a first approximation neocortical tissue is much the same everywhere, micro and mini columns perpendicular to the sheet, which has, I belive, 6 layers. But different neocortical areas have different subcortical connections, ultimately connecting with different sensory inputs and motor outputs. And neocortical areas also have varying patterns of connectivity with one another. So, what we have is one basic architecture which supports a variety of specializations. Sounds like a "mixture of experts."
There is much more at the link.
Check out the underlying research paper:
Abstract: Biological systems perceive the world by simultaneously processing high-dimensional inputs from modalities as diverse as vision, audition, touch, proprioception, etc. The perception models used in deep learning on the other hand are designed for individual modalities, often relying on domain-specific assumptions such as the local grid structures exploited by virtually all existing vision models. These priors introduce helpful inductive biases, but also lock models to individual modalities. In this paper we introduce the Perceiver - a model that builds upon Transformers and hence makes few architectural assumptions about the relationship between its inputs, but that also scales to hundreds of thousands of inputs, like ConvNets. The model leverages an asymmetric attention mechanism to iteratively distill inputs into a tight latent bottleneck, allowing it to scale to handle very large inputs. We show that this architecture is competitive with or outperforms strong, specialized models on classification tasks across various modalities: images, point clouds, audio, video, and video+audio. The Perceiver obtains performance comparable to ResNet-50 and ViT on ImageNet without 2D convolutions by directly attending to 50,000 pixels. It is also competitive in all modalities in AudioSet.