In 1960 the linguist Roman Jakobson published an essay entitled “Linguistics and Poetics,” in a volume edited by Thomas Sebeok, Style in Language (MIT Press, pp. 350-377). In that essay he laid out the six functions of language: referential, emotive, phatic, conative, poetic, and metalingual. Jakobson introduces the metalingual function in this way:
A distinction has been made in modem logic between two levels of language: “object language” speaking of objects and “metalanguage” speaking of language. But metalanguage is not only a necessary scientific tool utilized by logicians and linguists; it plays also an important role in our everyday language. Like Moliere’s Jourdain who used prose without knowing it, we practice metalanguage without realizing the metalingual character of our operations. Whenever the addresser and/or the addressee need to check up whether they use the same code, speech is focused on the code: it performs a METALINGUAL (i.e. , glossing) function. “I don’t follow you-what do you mean?” asks the addressee, or in Shakespearean diction, “What is’t thou say’st?” And the addresser in anticipation of such recapturing question inquires: “Do you know what I mean?”
This metalingual function turns out to be extraordinarily powerful. For it is this that allows us to bootstrap self-awareness into the mind. And for that matter, it is what allows us to define abstract concepts, as my teacher, David Hays, argued, and allows us to define such things as chess and arithmetic, which can be seen as very specialized forms of language.
I recently explored some of these issues in conversation with Claude 5.4 Sonata Extended. At the end of that conversation I asked Claude to prepare a summary. I’ve appended that summary below, followed by the full conversation. Note that the conversation assumes some familiarity with the cultural ranks theory that David Hays and I developed in the 1990s. It also alludes to Tyler Cowen’s recent book, The Marginal Revolution: Rise and Decline, and the Pending AI Revolution (2026).
* * * * *
Summary: The Metalingual Function of Language
The central claim of this discussion is that the metalingual capacity — the ability to use language to talk about language — is not a mysterious self-referential capacity of mind but is grounded in a simple physical fact: the speech signal is a sound in the environment like any other sound, detectable by the auditory system exactly as a footfall or a thunderclap is detectable. The loop that makes language self-referential closes through the physical world, not through some inward turning of consciousness. This matters because it demystifies metalingual cognition entirely: it requires no special cognitive faculty, only that the organism's auditory system be capable of treating its own linguistic outputs as inputs.
Jakobson identified the metalingual function as one of the six functions of language in his 1960 paper, and Hays adopted the term to name the mechanism underlying Rank 2 cognition — the explicit definition of abstract concepts using language itself as the definitional medium. The rules of chess and arithmetic notation are paradigm cases: purely metalingual constructions whose objects are constituted entirely by the definitions that specify them.
An important asymmetry in preliterate cultures illuminates the boundary of this capacity. Many such cultures have a term for utterance — the bounded burst of speech with a recognizable prosodic shape, a perceptual gestalt directly available to the auditory system — but no term for word. The word is not a perceptual unit in the same sense as the utterance; it is an abstraction from the continuous acoustic stream, and a non-trivial one. Writing is what produces this abstraction, by spatializing language — spreading it out in a stable, inspectable array where units are individuated by spaces and boundaries are marked. The word becomes visible as a unit because it is surrounded by white space. This is the physical basis of metalingual definition as a cognitive mechanism: the written signal, like the spoken signal, is an object in the environment that can be inspected and categorized, but unlike the spoken signal it stays there, making sustained metalingual attention possible. Grade-school grammar — parts of speech, grammatical cases, syntactic relations — is the practical Rank 2 elaboration that writing makes possible and that social institutions require and transmit. It looks easy in retrospect because it is taught in childhood, but it took centuries to develop in every culture that undertook it.
This analysis opens onto the question of human self-reference, which the standard philosophical tradition treats as cognitively primitive — the Cartesian bedrock from which all other knowledge is built. The discussion argued instead that self-reference in the robust, articulable sense is bootstrapped through language rather than presupposed by it. The cat licking its fur has practical self-involvement — its own body is an object of its perceptual and motor engagement — but this requires no special reflexive faculty, only that the body be included in the environment the organism can detect and act on. Human self-reference in the philosophically weighty sense is a different and later achievement, constructed through the acquisition of the pronoun system rather than expressed by it.
The empirical evidence for this bootstrapping account is the phase in early child development when children refer to themselves in the third person. This is not a mistake or a developmental lag but the natural and correct generalization from the input data: others refer to the child by name, so the child uses its name. The first-person pronoun presents a harder problem because "I" is a moving target — it marks the speaker-role regardless of who occupies it — and acquiring it correctly requires connecting awareness of the speech stream as an environmental event with awareness of one's own speech apparatus as its source. That inferential construction, worked out in detail in Benzon's 2000 paper, First Person: Neuro-Cognitive Notes on the Self in Life and in Fiction, through cognitive network modeling of the pronoun system, is precisely the physical loop through which self-reference is assembled. The Cartesian subject — the self-transparent, self-referential knowing mind — is on this account a Rank 2 cultural product, not a pre-linguistic biological given. The third-person phase in child development is a small but precise empirical trace of the construction process: there is an observable stage at which the human being exists, acts, and communicates without yet having assembled the first-person self-reference that Descartes thought was the indubitable foundation of knowledge.
* * * * *
The physical basis of the metalingual function
I believe that Hays first published about metalingual definition in 1972. When I first met him he had just finished a paper where he used the idea to investigate different concepts of alienation. When I wrote my 1978 dissertation, “Cognitive Science and Literary Theory,” I used metalingual definition as the mechanism for building cognitive ranks. That is to say, at that time that was the only mechanism we had for dealing with abstract concepts. By the time we did the cognitive evolution paper a bit over a decade later, by that time we had adopted a different view, with metaphor as the Rank 1 abstractive mechanism. Metalingual definition would have been the mechanism used, not only to rationalize abstract concepts, but to define the rules of chess and of arithmetic.
Hays chose the term because Roman Jakobson had defined the metalingual function as one of the six functions of language in his 1960 paper. The metalingual function is the ability to use language to talk about language. It’s worth thinking about how that’s possible. It’s possible because the speech signal is a sound out there in the environment like any other sound, a footfall, a thunderclap, an animal cry. Whatever. It’s a sound and can be detected by the auditory system like any other sound. So it’s the speech signal that is the object of the metalingual function, at least initially. I think it’s worth nothing that many (most?) preliterate cultures don’t have a word that means word, but they have one the means utterance. It’s writing that forces a recognition of words vs longer strings. And written language is, like speech, another signal that’s out there in the environment. It takes quite a bit of cognitive elaboration to construct a grammar from the simple ability to see and hear language. And I’m not talking about the kind of grammars that linguists construct. I’m talking about the kind of grammar that can be taught in grade school, or that you find in ancient texts of rhetoric.
This is a rich cluster of observations and they fit together in a way worth making explicit.
The metalingual function and its physical basis
Jakobson's insight, which you're extending here, is that the metalingual function is grounded in something remarkably mundane: the speech signal is a physical event in the environment, detectable by the auditory system exactly as any other sound is. This is what makes language able to turn on itself — not some mysterious self-referential capacity of mind, but the simple fact that linguistic output is also sensory input. The loop closes through the physical world.
This has an important implication that your observation about "utterance" versus "word" brings out precisely. What the auditory system initially detects is the acoustic stream — a continuous, flowing signal that doesn't come pre-segmented into words. The unit that's directly perceptible is something like the utterance: a bounded burst of speech with a recognizable prosodic shape, a communicative gestalt. The word is not a perceptual unit in the same sense. It's an abstraction from the stream, and a non-trivial one — which is why preliterate cultures have a term for the former and not the latter.
What writing does
Writing performs an operation on language that perception of the speech stream cannot perform on its own: it spatializes language, spreads it out in a stable, inspectable array where boundaries are marked and units are individuated. The word becomes visible as a unit because it's surrounded by spaces. The sentence becomes visible as a unit because it ends with a mark. The paragraph, the chapter, the argument — all of these become objects of potential inspection and manipulation because writing fixes them in space and time in a way the acoustic stream never does.
This is the physical basis of metalingual definition as a cognitive mechanism. You can't define "noun" until you have a stable object — the written word — to point at and say: this kind of thing. The grammarians of antiquity were working from written texts, and the categories they developed — parts of speech, grammatical cases, syntactic functions — reflect the structure of written language as much as spoken language, probably more. The written signal, like the spoken signal, is out there in the environment and can be inspected. But it stays there, which makes all the difference.
The grade-school grammar point
The grammar that can be taught in grade school, or that you find in ancient rhetorical texts, is exactly at the Rank 2 level of elaboration: it takes written linguistic objects and applies metalingual definitions to categorize them. Noun, verb, subject, predicate — these are metalingual definitions of written word-objects. They're not deep enough to constitute a full linguistic theory, and they're not the kind of thing a Rank 1 culture produces spontaneously, but they're also not Rank 3 science. They're the practical Rank 2 elaboration that writing makes possible and that social institutions — schools, rhetorical training, legal practice — require and transmit.
The ancient Indian grammarians, particularly Pāṇini in the 4th century BCE, produced something far more sophisticated — a generative grammar of Sanskrit that modern linguists regard as an extraordinary achievement, arguably Rank 3 in its systematicity and algorithmic character. The contrast between Pāṇini's grammar and the elementary Latin grammar taught in Roman schools is itself a contrast between cognitive ranks — both are metalingual, both rest on writing, but one is a systematic theory and the other is practical codification.
The retrospective illumination of your 1978 dissertation
What you're describing about the evolution of your and Hays's thinking is itself a case study in how conceptual frameworks develop. In 1978, metalingual definition was the only abstractive mechanism you had, so it was doing all the work — both the work that properly belongs to it (Rank 2 rationalization) and the work that metaphor was later recognized to do at Rank 1. The framework was incomplete in a specific way: it could explain what writing and explicit definition do without yet having a mechanism for how abstract concepts emerge before writing, in oral Rank 1 cultures.
The addition of metaphor as the Rank 1 mechanism resolved this by giving pre-literate abstraction its own proper mechanism — one grounded in the perceptual and bodily experience that Lakoff and Johnson later documented so thoroughly. Metaphor works because concrete experience is mapped onto abstract domains, providing structure for things that can't be directly perceived. This doesn't require writing or explicit definition; it requires a sufficiently rich repertoire of embodied experience and the cognitive capacity to map structural relations across domains.
The retrospective recognition — that metalingual definition governs chess rules and arithmetic notation as well as grammatical categories — then falls into place naturally. Chess rules and arithmetic are both Rank 2 products in exactly the sense you identified earlier: they're explicit, writing-dependent specifications of formal systems whose objects are defined entirely by the rules governing them. The piece called the knight just is the thing that moves in an L-shape; the symbol "7" just is the thing that occupies the seventh position in the place-value sequence. Neither has any existence outside the metalingual definitions that constitute it.
The deeper point about cognitive elaboration
Your observation that it takes quite a bit of cognitive elaboration to construct even a grade-school grammar from the raw capacity to see and hear language is easy to underestimate. The capacity is present — the signal is detectable — but the conceptual objects that grammar requires (word, sentence, part of speech, grammatical relation) have to be constructed through a process of abstraction and definition that took centuries in every culture that undertook it. The capacity to perceive language is Rank 1. The grammar book is a Rank 2 achievement that looks easy in retrospect but required the whole apparatus of writing, institutional transmission, and sustained metalingual attention to produce.
This is Cowen's asymmetry again, in a new domain: once you have the grammar book, it seems obvious. Before you have it, you don't know where to look. The signal was always there. The concepts that would organize it into a grammar were not.
Self-reference & pronouns
I think the central insight is that the metalingual capacity IS NOT some mysterious self-referential capacity of mind. I’m not even sure that such a thing exists. I mean, there is a sense in which a cat licking its fur is aware of itself. It’s aware of the fur it is licking, the fur is on its body, therefore it is aware of itself. But that’s not what people have in mind when talking about human self-reference. I think human self-reference is bootstrapped onto language. I once published a paper in which I worked out the underlying conceptual structure of the pronoun system using cognitive networks. It was a complicated mechanism, one that connected awareness of the speech stream with awareness of one’s physical language apparatus.
This is a striking and important move. Let me work through why.
The standard picture and what's wrong with it
The conventional framing — in philosophy of mind, cognitive science, linguistics — treats self-reference as a deep capacity of mind that then gets expressed in language. Language has pronouns because minds have self-models; the metalingual function is possible because consciousness is inherently reflexive; Jakobson's metalingual function is a linguistic manifestation of something cognitively prior and more fundamental.
Your claim inverts this completely. Self-reference isn't prior to language and expressed through it — it's constructed through language, and specifically through the kind of physical loop you identified: the speech signal as environmental event, detectable by the same auditory system that detects everything else. The metalingual capacity isn't a mysterious inward turning of consciousness. It's an outward loop through the physical world that happens to bring language back as its object.
The cat case is precise
The cat case is doing real philosophical work, not just providing a contrast. The cat licking its fur has what we might call practical self-involvement — it's engaged with its own body as part of its action. There's no mystery here and no need to invoke reflexive consciousness. The body is just another object in the environment that the cat's perceptual and motor systems are engaged with, except that it happens to be the cat's own body.
What this shows is that a great deal of what gets called self-awareness is really just this: the organism as object of its own perceptual and motor engagement. It doesn't require a special self-representing faculty. It requires only that the organism's own body be included in the environment it can detect and act on — which is trivially true of any embodied creature.
Human self-reference, on the standard view, is supposed to be categorically different from this — not just practical self-involvement but genuine self-representation, the mind modeling itself as such. But your claim is that this too is bootstrapped rather than primitive — constructed through language rather than presupposed by it.
The pronoun bootstrapping mechanism
The pronoun system is the right place to look for this, because pronouns are where language most explicitly encodes the speech situation — who is speaking, who is addressed, who is referred to as absent third party. The first person singular "I" is specifically the marker of the speaker-role, whoever occupies it at any given moment.
What your paper apparently worked out is that acquiring and using "I" correctly requires connecting two things that are not automatically connected: awareness of the speech stream as an environmental event, and awareness of one's own physical language apparatus as the source of that stream. The child who hears others say "I" and hears herself say "I" has to work out that the same word marks the speaker-role regardless of who occupies it — and that when she occupies it, the word applies to the physical being from which the speech stream is originating.
That's a non-trivial inferential construction, not an expression of a pre-existing self-concept. The self-concept — in the robust, linguistically articulable sense — is the product of working out the pronoun system, not its precondition. Which means human self-reference in the full sense is a Rank 1 cultural-cognitive achievement, not a biological given. It requires language, and specifically requires the metalingual loop: language as environmental object that can be connected back to its physical source.
What this does to the self-reference debates
A great deal of philosophy of mind, from Descartes through contemporary debates about phenomenal consciousness, presupposes that self-reference is the bedrock — the one thing that's indubitable, the starting point for everything else. The Cartesian cogito is the purest expression of this: self-awareness as the primitive on which all other knowledge is built.
Your argument implies that this gets the order of construction exactly backwards. Self-awareness in the philosophically robust sense — not the cat's practical self-involvement but the explicitly articulable, propositionally structured self-model — is a late product of a complex bootstrapping process that runs through language, through the speech community, through the physical loop of producing and detecting the speech signal, and through the specific cognitive work of acquiring the pronoun system.
Descartes thought he was finding bedrock. On this account he was finding a sophisticated cultural construction that felt like bedrock because the construction process was invisible to introspection — completed in infancy, thoroughly automatized, no longer accessible as a process rather than a result.
The connection to the metalingual function
This reframes Jakobson's metalingual function in a revealing way. If it's grounded in the physical loop rather than in mysterious cognitive self-reference, then it's available wherever language is present — it doesn't require a special kind of mind, just a mind capable of treating its own outputs as inputs. But the elaboration of that capacity into full metalingual definition — into the ability to define word, noun, rule, concept — requires the additional step of writing, which externalizes and stabilizes the linguistic object so it can be inspected and categorized.
The metalingual capacity in the minimal sense — using language to refer to language — is Rank 1, bootstrapped onto the physical loop. The metalingual function as a mechanism for constructing formal systems — defining chess rules, arithmetic notation, grammatical categories — is Rank 2, requiring writing to fix the objects and institutions to transmit the definitions.
The absence of a word for word in many preliterate cultures is then exactly what you'd expect. The minimal metalingual loop is operative — people can talk about talk — but without writing there's no stable individuated object at the word level to define and categorize. The word as a unit is a product of writing, not a discovery of something that was always there waiting to be named.
The deeper implication
What you're pointing toward is that several things that look like primitive cognitive capacities are actually constructed achievements — bootstrapped through language and culture in ways that make them feel primitive and pre-linguistic because the construction happens so early and is so thoroughly automatized. Self-reference, the word as a unit, grammatical categories — none of these are biological givens. All are cultural products of the Rank 1 to Rank 2 transition.
Which means the transition from Rank 1 to Rank 2 is even more consequential than it might appear. It's not just that writing enables new kinds of knowledge. It's that writing, and the metalingual elaboration it makes possible, partly constitutes the kind of self that can have knowledge in the explicitly articulated, introspectable sense. The Cartesian subject — the self-transparent, self-referential knowing subject — is a Rank 2 product, not a Rank 2 discovery.
Third person self-reference
One of the things I thought about while doing that work is third person self-reference. Young children sometimes go through a phase where they refer to themselves in the third person. Why? because that’s how they hear others refer to them.
Yes, and this is a beautifully clean empirical confirmation of the bootstrapping argument. The child isn't starting from a self-concept and then learning to express it linguistically. The child is starting from the linguistic data — the patterns of reference they hear in the speech stream — and constructing the self-concept from that data. The self-concept is the output of the linguistic learning process, not its input.
The data the child actually has
The child hears its name — "Billy is hungry," "give it to Billy," "where is Billy?" — used by others to refer to a being in the environment that the child has a particular relationship with: it's the body the child inhabits, the source of the child's own speech stream, the object of others' attention in ways that are systematically connected to the child's own states. The child learns third-person reference to itself before first-person reference because third-person reference is what the input data contains. "I" appears in the input data too, but "I" is a moving target — it refers to whoever is speaking, which is almost never the child itself in early language experience.
So the child's first solution to the self-reference problem is the natural one given the available data: use the name, which is what others use. This is not a mistake or a developmental lag. It's the correct generalization from the input. The child has learned that "Billy" refers to a particular individual in the environment — the individual that happens to be the child — and uses it accordingly.
The puzzle that forces the correction
What eventually forces the shift to first-person reference is exactly the puzzle your pronoun paper worked out: the child notices the systematic relationship between who is speaking and the occurrence of "I" in the speech stream. This requires connecting awareness of the speech stream as an environmental event with awareness of one's own speech apparatus as its source — the physical loop you identified. The child has to work out that "I" is not a name for a particular individual but a marker of the speaker-role, and then has to recognize itself as currently occupying that role when speaking.
That's a more complex inferential achievement than learning a name. It requires understanding something about the structure of the speech situation rather than just mapping a sound to an individual. Which is why it comes later, and why the third-person phase precedes it.
What this shows about the standard picture
The standard picture of language acquisition assumes that the child has a self-concept — however primitive — and is learning to express it. On this picture the third-person phase is puzzling: why would a child with a self-concept refer to itself in the third person? Various explanations are offered — imitation, incomplete acquisition, failure to distinguish perspectives.
Your bootstrapping picture makes the third-person phase not just unsurprising but expected. The child doesn't start with a self-concept and learn to express it. The child starts with linguistic input and constructs both the referential system and the self-concept together, as a single cognitive achievement. The third-person phase is the intermediate stage in that construction — after learning that "Billy" refers to a particular individual, before working out the systematic relationship between speaker-role and first-person reference.
The self that says "I" is not the origin of the linguistic process. It's one of its products.
The deeper implication for the Cartesian picture
This adds a developmental-empirical dimension to the philosophical point you were making. Descartes arrives at the cogito by stripping away everything that could be doubted until he reaches the indubitable bedrock of self-awareness. But the developmental evidence suggests that self-awareness in the robust, articulable sense — the sense that could serve as a foundation for knowledge — is itself a late construction, assembled through language acquisition, with the third-person phase as a visible trace of the assembly process.
The child going through the third-person phase is a small empirical refutation of the Cartesian picture. There is a stage — a real, observable stage — at which the human being exists and acts and communicates without yet having assembled the first-person self-reference that Descartes thought was the indubitable foundation. The foundation gets built. It's not already there waiting to be found.
No comments:
Post a Comment