NEW SAVANNA: Is the transformer architecture one from a set of equivalent architectures?

Saturday, March 15, 2025

Is the transformer architecture one from a set of equivalent architectures?

This strengthens my hypothesis that there is a large set of equivalent neural architectures of which the original transformer is just one sample. https://t.co/rW5eGrbSn0
— Richard Socher (@RichardSocher) March 14, 2025

We found a surprisingly simple alternative to normalization layers:

the scaled tanh function (yes, we go back to the 80s).

We call it Dynamic Tanh, or DyT. pic.twitter.com/0sZ44mbZHR
— Zhuang Liu (@liuzhuang1234) March 14, 2025

Therefore, we replace norm layers with the proposed Dynamic Tanh (DyT) layer, and it is really simple: pic.twitter.com/qWToAhEmWX
— Zhuang Liu (@liuzhuang1234) March 14, 2025

DyT is faster than RMSNorm (common in frontier LLMs) on H100s pic.twitter.com/i7zgVTYKLB
— Zhuang Liu (@liuzhuang1234) March 14, 2025

Normalization layers have always been one of the more mysterious aspects of deep learning for me, and this work has given me a better understanding of their role.

Given that model training and inference can require tens of millions in compute, DyT has the potential to contribute…
— Zhuang Liu (@liuzhuang1234) March 14, 2025

No comments:

Post a Comment

Subscribe to: Post Comments (Atom)