This is an important point.
— davidad 🎇 (@davidad) October 19, 2023
While all the common *sampling* strategies only choose 1 token at a time,
attention-layer training does *not* propagate gradients *backward* 1 token at a time,
meaning that some intermediate-layer features probably model aspects of much later tokens. https://t.co/b0nw0RLf3f
And shame shame shame on the experts for allowing, encouraging, instructing so many to think that this is how they work.
This technology is too important to be left in the hands of these experts. They may be expert in programming the engines and "training" them, but that's as far as their expertise goes. They need to rethink their "understanding," if you can call it that, of how they function to produce text.
Assuming they have to "dumb down" the information for other people?
ReplyDeleteAlas, I fear it's more complicated than that. I think many of them half-way believe it themselves. I certainly got a lot of push-back when I posted on the subject over at LessWrong, where there are technical experts.
DeleteThat's what I wondered.
Delete