Thursday, May 23, 2024

Fast weights, the recent past, and transformers

Jimmy Ba, Geoffrey Hinton, Volodymyt Mnih, Joel Z. Leibo, Catalin Ionescu, Using Fast Weights to Attend to the Recent Past, arXiv:1610.06258v3 [stat.ML] 5 Dec 2016

Abstract: Until recently, research on artificial neural networks was largely restricted to systems with only two types of variable: Neural activities that represent the current or recent input and weights that learn to capture regularities among inputs, outputs and payoffs. There is no good reason for this restriction. Synapses have dynamics at many different time-scales and this suggests that artificial neural networks might benefit from variables that change slower than activities but much faster than the standard weights. These “fast weights” can be used to store temporary memories of the recent past and they provide a neurally plausible way of implementing the type of attention to the past that has recently proved very helpful in sequence-to-sequence models. By using fast weights we can avoid the need to store copies of neural activity patterns.

Imanol Schlag, Kazuki Irie, Jürgen Schmidhuber, Linear Transformers Are Secretly Fast Weight Programmers, arXiv:2102.11174v3 [cs.LG] 9 Jun 2021

Abstract: We show the formal equivalence of linearised self-attention mechanisms and fast weight controllers from the early ’90s, where a “slow” neural net learns by gradient descent to program the “fast weights” of another net through sequences of elementary programming instructions which are additive outer products of self-invented activation patterns (today called keys and values). Such Fast Weight Programmers (FWPs) learn to manipulate the contents of a finite memory and dynamically interact with it. We infer a memory capacity limitation of recent linearised softmax attention variants, and replace the purely additive outer prod- ucts by a delta rule-like programming instruction, such that the FWP can more easily learn to correct the current mapping from keys to values. The FWP also learns to compute dynamically changing learning rates. We also propose a new kernel function to linearise attention which balances simplicity and effectiveness. We conduct experiments on synthetic retrieval problems as well as standard machine translation and language modelling tasks which demonstrate the benefits of our methods.

No comments:

Post a Comment