This strengthens my hypothesis that there is a large set of equivalent neural architectures of which the original transformer is just one sample. https://t.co/rW5eGrbSn0
— Richard Socher (@RichardSocher) March 14, 2025
We found a surprisingly simple alternative to normalization layers:
— Zhuang Liu (@liuzhuang1234) March 14, 2025
the scaled tanh function (yes, we go back to the 80s).
We call it Dynamic Tanh, or DyT. pic.twitter.com/0sZ44mbZHR
Therefore, we replace norm layers with the proposed Dynamic Tanh (DyT) layer, and it is really simple: pic.twitter.com/qWToAhEmWX
— Zhuang Liu (@liuzhuang1234) March 14, 2025
DyT is faster than RMSNorm (common in frontier LLMs) on H100s pic.twitter.com/i7zgVTYKLB
— Zhuang Liu (@liuzhuang1234) March 14, 2025
Normalization layers have always been one of the more mysterious aspects of deep learning for me, and this work has given me a better understanding of their role.
— Zhuang Liu (@liuzhuang1234) March 14, 2025
Given that model training and inference can require tens of millions in compute, DyT has the potential to contribute…
No comments:
Post a Comment