NEW SAVANNA: GPT-3 meets “Kubla Khan” and the results are interesting, but not encouraging for AI poetry

Tuesday, September 1, 2020

GPT-3 meets “Kubla Khan” and the results are interesting, but not encouraging for AI poetry

I experimented a bit with GPT-3 and poetry in conjunction with my interview with Hollis Robbins. As you recall, she had written a book about the African-American sonnet tradition. I suggested, then, that we re-enact the contest between John Henry and the steam-drill as a contest between a real poet, she chose Marcus Christian, and GPT-3. GPT-3 produced one sonnet, sorta’, which we covered in our interview on July 20 [1], and a second, which I commented on in a post, GPT-3 writes two sonnets, sorta’. The first is better than the second. [digital humanities], on July 21, 2020.

Quite independently of this Gwern Branwen had been experimenting with poetry using both GPT-2 and GPT-3. He offered this general remark about prompting GPT-3’s performance:

For the prompt, just using the title/author is slightly unsatisfactory, as GPT-3 has memorized many of these famous poems and will, if you do not specify otherwise, happily complete them (which is an entirely valid thing to do, and does not represent ‘overfitting’—even when a completion does start with the next real lines, it will typically diverge in an entirely acceptable way after a few real lines). After playing around for a while, I thought I’d embrace the “Transformer Poetry” theme and make that the prompt:

Transformer AI poetry: Poetry classics as reimagined and rewritten by an artificial intelligence.

Using Coleridge’s “Kubla Khan” as a prompt for GPT-3

One of the poems he choose to work with is “Kubla Khan” (scroll down the page):

Transformer AI poetry: Poetry classics as reimagined and rewritten by an artificial intelligence.
Kubla Khan
Samuel Taylor Coleridge
In Xanadu did Kubla Khan
A stately pleasure-dome decree:
Where Alph, the sacred river, ran
Through caverns measureless to man
Down to a sunless sea.
I saw the sacred river Alph
Rolling darkly, angrily,
Its roaring waters hurled
Down to the sunless sea.
Then I saw the dwelling-place
Of Coleridge in Xanadu:
A hundred caverns lofty, vast,
Down to a sunless sea.
I saw the caverns where he wrote,
And where he played the immortal strains
And poured out all his soul,
In lavish abundance:
This cavern was his Dawl, his Malta,
And where his enchanted spirit
Broke from its bounds in triumph,
Filled with the Glory he foresaw,
Piping and singing of immortal things
And ever-rising to immortal heights,
A hundred fathom deep.
That vision had the Prophecy
Which filled the world with song:
The glory of Kubla Khan
And Coleridge’s Song of Xanadu.

Notice the single capital letter “I” at the beginning of the line after “sunless sea”. I asked Gwern about that and he responded that it was necessary to keep GPT-3 from simply completing the poem as Coleridge had written it as GPT-3 had certainly “memorized” the original, which would be the obvious completion for the prompt.

General Comments on GPT-3’s performance

GPT-3’s continuation falls roughly into two components, the first continues through “In lavish abundance” and the second picks up with the next line and continues on to the end. The first component more or less works within the ‘territory’ indicated in the prompt, which includes the first five lines of the original poem. It emphasizes that territory. The “sunless sea” line is repeated twice, Alph shows up again, and we have the waters, the cavern, Xanadu, and Coleridge himself. The poem then begins shifting toward the second component with “I saw the caverns...immortal strains...all his soul...”

The second component shifts toward the poet himself, Coleridge, and his poetizing. We have his “enchanted spirit” breaking free; the poem is foresees, pipes, sings, rises to immortal heights, albeit “A hundred fathom deep” and so forth, onto song, glory, and Xanadu. For what it’s worth, “Dawl” gave me a pause and I had to do a bit of digging around before Google Translate told me that it is is Maltese for light.

The whole thing is rather rough, but that loose two-part structure is interesting. It gives the whole thing a crude coherence. On other hand, the repetition of lines from the prompt and the inclusion of Coleridge’s name are annoying. They follow, of course, from the nature of this exercise, and GPT-3 composes by, in effect, ‘predicting’ the next word, and the next, and so on. Finally, I note that while the versification of “Kubla Khan” is intricate, with varying line lengths, alliteration at various points, and a complex rhyme scheme, there’s not much to be said for GPT-3’s versification.

More specific comments

“Kubla Khan” presents a peculiar challenge to an AI engine that is trained, as GPT-3 is, to guess what comes next. When faced with this kind of task, to create a poem by continuing on from the initial lines of a human-produced poem, one wants, I presume, something more or less like the original, but different. The world contains zillions of sonnets, for example, but only one “Kubla Khan”. Coleridge never wrote another poem like it – I’ve read them all, though some years a go – nor, as far as I know, has anyone else. So GPT-3 has no other models to go by.

I note further that the poem does in fact have a very elaborate formal structure, which I have described in some detail [2], so one can imagine another poet writing a poem in the manner of Coleridge’s original, call it a Kubla, though they’d have to decided just which aspects of that structure are important and which are not. For example, “Kubla Khan” is 54 lines long and in two parts. The first is 36 lines long and the second is 18 lines. Do we require the same of a new Kubla? Or is it sufficient that a Kubla have a first section that is twice as long as the second, say 30 and 15, or 20 and 10, or for that matter, 44 and 22, and so forth? Once we change the length, however, we’re going to have to alter the rhyme scheme. The poet makes such decisions, writes and poem, and we can judge the results.

GPT-3 obviously did nothing of the kind.

One aspect of the poem’s formal structure is that the two parts of the poem are quite different in character. We can see that, for example, in the deployment of pronouns, which is why I asked Gwern about his inclusion of “I” in the prompt. There are only four pronouns in the first 36 lines of the poem, none of them “I”. The second part has 16 pronouns, with “I” being used three times. Could the ‘premature’ appearance of ‘I’ thrown GPT-3 off its stride?

There’s more to the difference between the two parts of “Kubla Khan” than the use of pronouns. The first part of the poem is set in Xanadu, presumably, and is rich in imagery about physical place: the gardens, chasm, fountain, rocks, the river, the reflection (“shadow” in the poem) of the dome. But people aren’t so much present. Yes, Kubla decrees, the woman moans, and ancestral voices prophesy, but they all settle nicely into the larger landscape.

The second part of the poem is quite different. It starts off, not with a place, but a person, “a damsel with a dulcimer” and it continues on with her and her song and the vision containing her. It isn’t until we are nine lines into the second part that we have an explicit connection with the first part, in 46: "I would build that dome in air”. That is to say, until we reach that point, it’s as though we had jumped from line 36, “a sunny pleasure-dome with caves of ice”, into an entirely different poem. How is an AI engine built on word-to-word continuity to deal with that kind of extreme discontinuity within a single poetic text?

I could say a great deal more about the structure of “Kubla Khan” (see [2]), but this is enough to give a sense of the problem GPT-3 faced. “Kubla Khan” has an elaborate structure, both in its semantics (which I’ve only hinted at) and its versification (about which I’ve said almost nothing), and manages to bridge a yawning discontinuity in its middle. A guess-the-next-word engine like GPT-3 didn’t have a chance.

What are we to make of GPT-3s performance in this case?

I don’t quite know. It is easy to say, well, GPT-3 isn’t much of a poet. I don’t find that at all satisfying. Just as I thought GPT-3’s first continuation of Marcus Christian’s sonnet was interesting, without for a minute believing it was a good poem [1], I feel the same about GPT-3s continuation of “Kubla Khan”. In both cases I want to know how GPT-3 did it? Because GPT-3 is NOT a human being and did NOT ‘learn’ language in a way remotely resembling the human process. Yet in some cases it produces a remarkable simulacrum of human behavior and in other cases, like this one, the simulacrum leaves much to be desired.

The interesting and telling comparison, it seems to me, is between GPT-3 on natural language and various AI chess engines. As you may know, chess was one of the original problems taken up by artificial intelligence [3]. Language was subjected to computational investigation at the same time, but by a different group of thinkers, thinkers interested in translating from one natural language to another [4]. In 1997 IBM’s Deep Blue wins a six-game match against Gary Kasparov. Ever sense then computer’s have been better at chess than even the best humans.

The language skills of computers, however, do not match those of even mediocre humans. It is true that machine translation programs can provide useful translations for pedestrian purposes. If you are familiar with a subject area and want to get a sense of what some document in, say, Mandarin, says in that area, by all means, see what Google Translate does with it. But if you are working with a legal document, there is no computer program that is going to give you a legal-quality translation. You need a human translator.

It is that disparity, between computer performance on chess and computer performance with natural language, that interests me. GPT-3 does well on a number of limited natural language tasks. But it cannot write poetry. Why not?

Why is poetry so much more difficult for computers than chess while, for humans, they seem to be equally difficult? What can GPT-3 tell us about that difference? I don’t know, but whatever it is, is locked up in the language model GPT-3 has created. That model, like human language facility, is opaque to us.

So far.