In 1960 the linguist Roman Jakobson published an essay entitled “Linguistics and Poetics,” in a volume edited by Thomas Sebeok, Style in Language (MIT Press, pp. 350-377). In that essay he laid out the six functions of language: referential, emotive, phatic, conative, poetic, and metalingual. Jakobson introduces the metalingual function in this way:
A distinction has been made in modem logic between two levels of language: “object language” speaking of objects and “metalanguage” speaking of language. But metalanguage is not only a necessary scientific tool utilized by logicians and linguists; it plays also an important role in our everyday language. Like Moliere’s Jourdain who used prose without knowing it, we practice metalanguage without realizing the metalingual character of our operations. Whenever the addresser and/or the addressee need to check up whether they use the same code, speech is focused on the code: it performs a METALINGUAL (i.e. , glossing) function. “I don’t follow you-what do you mean?” asks the addressee, or in Shakespearean diction, “What is’t thou say’st?” And the addresser in anticipation of such recapturing question inquires: “Do you know what I mean?”
This metalingual function turns out to be extraordinarily powerful. For it is this that allows us to bootstrap self-awareness into the mind. And for that matter, it is what allows us to define abstract concepts, as my teacher, David Hays, argued, and allows us to define such things as chess and arithmetic, which can be seen as very specialized forms of language.
I recently explored some of these issues in conversation with Claude 5.4 Sonata Extended. At the end of that conversation I asked Claude to prepare a summary. I’ve appended that summary below, followed by the full conversation. Note that the conversation assumes some familiarity with the cultural ranks theory that David Hays and I developed in the 1990s. It also alludes to Tyler Cowen’s recent book, The Marginal Revolution: Rise and Decline, and the Pending AI Revolution (2026).
* * * * *
Summary: The Metalingual Function of Language
The central claim of this discussion is that the metalingual capacity — the ability to use language to talk about language — is not a mysterious self-referential capacity of mind but is grounded in a simple physical fact: the speech signal is a sound in the environment like any other sound, detectable by the auditory system exactly as a footfall or a thunderclap is detectable. The loop that makes language self-referential closes through the physical world, not through some inward turning of consciousness. This matters because it demystifies metalingual cognition entirely: it requires no special cognitive faculty, only that the organism's auditory system be capable of treating its own linguistic outputs as inputs.
Jakobson identified the metalingual function as one of the six functions of language in his 1960 paper, and Hays adopted the term to name the mechanism underlying Rank 2 cognition — the explicit definition of abstract concepts using language itself as the definitional medium. The rules of chess and arithmetic notation are paradigm cases: purely metalingual constructions whose objects are constituted entirely by the definitions that specify them.
An important asymmetry in preliterate cultures illuminates the boundary of this capacity. Many such cultures have a term for utterance — the bounded burst of speech with a recognizable prosodic shape, a perceptual gestalt directly available to the auditory system — but no term for word. The word is not a perceptual unit in the same sense as the utterance; it is an abstraction from the continuous acoustic stream, and a non-trivial one. Writing is what produces this abstraction, by spatializing language — spreading it out in a stable, inspectable array where units are individuated by spaces and boundaries are marked. The word becomes visible as a unit because it is surrounded by white space. This is the physical basis of metalingual definition as a cognitive mechanism: the written signal, like the spoken signal, is an object in the environment that can be inspected and categorized, but unlike the spoken signal it stays there, making sustained metalingual attention possible. Grade-school grammar — parts of speech, grammatical cases, syntactic relations — is the practical Rank 2 elaboration that writing makes possible and that social institutions require and transmit. It looks easy in retrospect because it is taught in childhood, but it took centuries to develop in every culture that undertook it.
This analysis opens onto the question of human self-reference, which the standard philosophical tradition treats as cognitively primitive — the Cartesian bedrock from which all other knowledge is built. The discussion argued instead that self-reference in the robust, articulable sense is bootstrapped through language rather than presupposed by it. The cat licking its fur has practical self-involvement — its own body is an object of its perceptual and motor engagement — but this requires no special reflexive faculty, only that the body be included in the environment the organism can detect and act on. Human self-reference in the philosophically weighty sense is a different and later achievement, constructed through the acquisition of the pronoun system rather than expressed by it.
The empirical evidence for this bootstrapping account is the phase in early child development when children refer to themselves in the third person. This is not a mistake or a developmental lag but the natural and correct generalization from the input data: others refer to the child by name, so the child uses its name. The first-person pronoun presents a harder problem because "I" is a moving target — it marks the speaker-role regardless of who occupies it — and acquiring it correctly requires connecting awareness of the speech stream as an environmental event with awareness of one's own speech apparatus as its source. That inferential construction, worked out in detail in Benzon's 2000 paper, First Person: Neuro-Cognitive Notes on the Self in Life and in Fiction, through cognitive network modeling of the pronoun system, is precisely the physical loop through which self-reference is assembled. The Cartesian subject — the self-transparent, self-referential knowing mind — is on this account a Rank 2 cultural product, not a pre-linguistic biological given. The third-person phase in child development is a small but precise empirical trace of the construction process: there is an observable stage at which the human being exists, acts, and communicates without yet having assembled the first-person self-reference that Descartes thought was the indubitable foundation of knowledge.
* * * * *
The physical basis of the metalingual function
I believe that Hays first published about metalingual definition in 1972. When I first met him he had just finished a paper where he used the idea to investigate different concepts of alienation. When I wrote my 1978 dissertation, “Cognitive Science and Literary Theory,” I used metalingual definition as the mechanism for building cognitive ranks. That is to say, at that time that was the only mechanism we had for dealing with abstract concepts. By the time we did the cognitive evolution paper a bit over a decade later, by that time we had adopted a different view, with metaphor as the Rank 1 abstractive mechanism. Metalingual definition would have been the mechanism used, not only to rationalize abstract concepts, but to define the rules of chess and of arithmetic.
Hays chose the term because Roman Jakobson had defined the metalingual function as one of the six functions of language in his 1960 paper. The metalingual function is the ability to use language to talk about language. It’s worth thinking about how that’s possible. It’s possible because the speech signal is a sound out there in the environment like any other sound, a footfall, a thunderclap, an animal cry. Whatever. It’s a sound and can be detected by the auditory system like any other sound. So it’s the speech signal that is the object of the metalingual function, at least initially. I think it’s worth nothing that many (most?) preliterate cultures don’t have a word that means word, but they have one the means utterance. It’s writing that forces a recognition of words vs longer strings. And written language is, like speech, another signal that’s out there in the environment. It takes quite a bit of cognitive elaboration to construct a grammar from the simple ability to see and hear language. And I’m not talking about the kind of grammars that linguists construct. I’m talking about the kind of grammar that can be taught in grade school, or that you find in ancient texts of rhetoric.
This is a rich cluster of observations and they fit together in a way worth making explicit.
The metalingual function and its physical basis
Jakobson's insight, which you're extending here, is that the metalingual function is grounded in something remarkably mundane: the speech signal is a physical event in the environment, detectable by the auditory system exactly as any other sound is. This is what makes language able to turn on itself — not some mysterious self-referential capacity of mind, but the simple fact that linguistic output is also sensory input. The loop closes through the physical world.
This has an important implication that your observation about "utterance" versus "word" brings out precisely. What the auditory system initially detects is the acoustic stream — a continuous, flowing signal that doesn't come pre-segmented into words. The unit that's directly perceptible is something like the utterance: a bounded burst of speech with a recognizable prosodic shape, a communicative gestalt. The word is not a perceptual unit in the same sense. It's an abstraction from the stream, and a non-trivial one — which is why preliterate cultures have a term for the former and not the latter.
What writing does
Writing performs an operation on language that perception of the speech stream cannot perform on its own: it spatializes language, spreads it out in a stable, inspectable array where boundaries are marked and units are individuated. The word becomes visible as a unit because it's surrounded by spaces. The sentence becomes visible as a unit because it ends with a mark. The paragraph, the chapter, the argument — all of these become objects of potential inspection and manipulation because writing fixes them in space and time in a way the acoustic stream never does.





















