I find it interesting that deep learning has done well with both images and texts, though they are very different kinds of objects presenting ostensibly different conceptual demands. So I decided to construct a thought experiment.
Let’s treat a text as a string of colored beads. We can assign each word the value of some color, any color, as long as each word TYPE is assigned a different value. Then we take each TOKEN in a text and replace it with a pixel having the color corresponding to its respective type. Now we’ve transformed a text into a string of pixels – color beads on a string. We do that for each text in a corpus and then model the corpus using the methods used in creating large language models.
To prompt the model we have to feed it a string of pixels. That’s easily done. We simply generate a natural language prompt, translate that prompt into pixel form and present those pixels to the model. The model will then extend the string in the normal way. We could even set up a pair of models so that one prompts the other with strings of pixels.
These pixel strings would be unintelligible to humans, nor are they likely to be very interesting as images. They’d just be a linear jumble of color. But these color jumbles are intelligible as texts if the proper word tokens are substituted for the color tokens.
You might object that this thought experiment is completely artificial. Well, yeah, it is. And in a way that’s the point. Consider this passage from a recent article Yann LeCun and Jacob Browning, What AI Can Tell Us About Human Intelligence:
This assumption is very controversial and part of an older debate. The neural network approach has traditionally held that we don’t need to hand-craft symbolic reasoning but can instead learn it: training a machine on examples of symbols engaging in the right kinds of reasoning will allow it to be learned as a matter of abstract pattern completion. In short, the machine can learn to manipulate symbols in the world, despite not having hand-crafted symbols and symbolic manipulation rules built in.
Contemporary large language models — such as GPT-3 and LaMDA — show the potential of this approach. They are capable of impressive abilities to manipulate symbols, displaying some level of common-sense reasoning, compositionality, multilingual competency, some logical and mathematical abilities, and even creepy capacities to mimic the dead. If you’re inclined to take symbolic reasoning as coming in degrees, this is incredibly exciting.
I think that second paragraph is misleading. In what sense are these LLMs manipulating symbols? They aren’t doing anything that isn’t done with those pixel strings in my (completely artificial) thought experiment. No one watching two large pixel-string models would think they were exchanging meaningful symbols. They’re just passing meaningless pixel strings back and forth.
The point is an old and obvious one, word meanings don’t exist in word forms, whether written or spoken. Word meanings exist in the minds of people speaking and writing. LLMs simply do not have access to those meanings. What’s interesting, and what we must understand, is that they can create such a convincing simulacrum of meaning based entirely on contextual information about word forms, signifiers.
And that’s what words and images have in common, context. But these contextual patterns are different in character. Images have a great deal of pixel-to-pixel continuity in 2D form. Continuity among words emerges only when their contextual relationships are represented in high dimensional spaces.
No comments:
Post a Comment