Sometime in the last two or three years I began thinking about the fact that the same transformer architecture that is used to create large language models (LLMs) is also used to create images. In the case of language models, the learning procedure involves predicting the next token. In image models it’s about predicting the next pixel. Thus, on some level and in some way a written text and a visual image can be treated as the same kind of thing.
So, how can we think about a written text as some kind of visual object. That’s really quite simple. Imagine that each word or each character is a colored bead. If you think at the word level, you’ll need lots of different colors, 10s of thousands of them. If you think at the character level, a much smaller number of colors will do, 48 for the letters (upper and lower case), one for a space, and on the order of 10s for punctuation marks. Whatever level you choose, a text becomes a collection of colored beads. Once you’ve done that it’s easy to arrange the text into one long string and feed all those strings to the transformer. That is to saw, the transformer architecture treats verbal texts as one-dimensional visual objects.
We don’t have to do anything fancy to treat images as strings of pixels of various colors. Just rasterize the images and code the colors with some appropriate convention and you’re done.
Now, all we have to do is come up with a way to think about how to translate back and forth between these two forms of representation so that we can use language to direct a bot to create an image and use a different bot to describe an existing image. I’ll leave thinking about that as an exercise for the reader, noting only that I’ve got a good number of posts where Claude describes images, and a few where ChatGPT modifies existing images according to simple verbal instructions.
We’re going to see both operations in this post. First I’m going to ask Claude 3.7 to describe an image. Then I’m going to post three copies I had ChatGPT make, each in a different style. Then I’m going to post the conversation I had with Claude about those copies. As always, my comments are in bold face while Claude’s are plain face.
Claude describes a simple country scene
Describe this picture.
This is a colorful drawing of a whimsical village scene with vibrant houses nestled in rolling green hills with mountains in the background. The houses have distinctive architectural features - some with tall spires, others with multiple windows, in bright colors including red, orange, yellow, and blue.
In the foreground is a flower garden full of colorful tulips in red, yellow, orange, and purple, along with blue forget-me-nots or similar small blue flowers. The drawing has a folk art or naive art style with bold colors and simplified forms.
The scene has a storybook quality with its cheerful colors and quaint buildings arranged on the hillside. There appears to be a small signature or date in the bottom right corner of the image, though it's not entirely clear what it says.
The drawing appears to be done with markers or colored pencils on paper, with a charming handmade quality to the artwork.
Right. I belong to an art club where we start with a line drawing and then color it as we wish. Some club members use colored pencils, others use markers. I used a marker.
That's wonderful! Art clubs are such a great way to express creativity in a supportive community. I like how everyone starts with the same line drawing but produces unique interpretations through their color choices and techniques. Your marker work gives the piece a particularly vibrant and bold quality - the colors really pop against each other, especially in the flower garden with those bright tulips against the green background.
Does your art club meet regularly to work on these projects together? The village scene has such a cheerful, storybook quality that feels uplifting.
We meet twice a week.
ChatGPT’s three copies
Note: I did not have ChatGPT make these copies in independent trials. It made all of them in the same session. It did the illuminated manuscript first, then the Mughal miniature, and the Japanese print last. The last two copies have a feature that's missing from the first.
Count the number of flowers and compare with the original. Do the same for the buildings. Notice the small structure at the lower right. That's not in the original image, but if you look closely at the original you'll see portions of some kind of wall peaking out between the leaves of the flowers. This is the third copy that ChatGPT made, but it's the first one Claude commented on.
Count the number of flowers and compare with the original. Do the same for the buildings. There's no small structure at the lower right. Remember, this is the first copy it did.
Count the number of flowers and compare with the original. Do the same for the buildings. Again, we have the small structure at the lower right. This is the second copy.
Claude talks about ChatGPT’s work
Now I want to do something different. I uploaded that picture to ChaGPT and asked it to make new drawings based on it. I’ve attached those new pictures. What are the styles? How can you tell?
The images you've shared show the same scene (the village with houses and flowers) rendered in three distinct medieval manuscript illumination styles:
Image 1 appears to be styled after a simple medieval illustration or woodcut. It features:
- A clean, unadorned border
- Muted, earthy colors with a textured parchment background
- Simple shading and hatching techniques
- A focus on the image itself without decorative elements
Image 2 is styled after a medieval illuminated manuscript, specifically resembling a Gothic or late medieval style. You can tell by:
- The ornate decorative border with scrollwork and foliate designs
- The enlarged decorative initial letter "I" (a common feature in illuminated manuscripts)
- More detailed rendering with slightly richer colors
- Gold/bronze accents in the border decorations
- The overall composition mimics a page from a medieval Book of Hours or similar text
Image 3 resembles a folk art illumination style, combining elements of medieval manuscript with folk art traditions. Notable features include:
- A floral border with repeating flower motifs arranged in a simple pattern
- Brighter, more saturated colors than Image 2
- A more stylized, decorative approach to the border
- Multiple frames (a gold inner frame and a blue outer frame)
- The style is reminiscent of Eastern European or Persian manuscript traditions
All three maintain the same basic scene but show how different historical artistic traditions would approach the presentation and framing of the content. They appear to be digitally created to mimic these traditional styles rather than hand-drawn with traditional materials.
Interesting. You are certainly correct about #2. If nothing else, the letter gives it away. What you say about the other two is reasonable. But it’s not what I asked ChatGPT to do. In the first case I asked it to do a Japanese wood block print. So you got the medium right, but not the culture. In the third case, I asked for something in the stye of a Mughal miniature. I’m not sure that it got it.
You're absolutely right, and I apologize for misidentifying the styles you requested from ChatGPT. Let me take another look at the images with this information: Image 1: Now that you mention it, I can see how this was intended to be a Japanese woodblock print (ukiyo-e) style. While it captured some elements like the defined outlines and flat color areas, it's missing key characteristics of authentic ukiyo-e like:
- The distinctive perspective techniques
- The typical stylized clouds, waves, or atmospheric elements
- The characteristic line work and patterning
- The signature color palette and printing effects
Image 3: For a Mughal miniature, it's indeed missing several key elements that would make it authentic to that tradition, such as:
- The intricate detail and fine brushwork typical of Mughal art
- The distinctive figure styles and proportions
- The characteristic architectural detailing
- The hierarchical scaling and perspective
- The more elaborate and specifically Persian/Indian decorative motifs in the border
It seems ChatGPT created approximations of these styles while still keeping much of the medieval European aesthetic of the original drawing. The attempts show the general concept but don't fully capture the distinctive cultural and artistic elements that define authentic Japanese woodblock prints or Mughal miniatures.
Ah, your observation about stylized clouds and waves in one is right on. And I agree with your remarks about #3. I was expecting more intricacy.
Your expectations are completely justified! For true Japanese woodblock prints, those stylized clouds, distinctive waves (like Hokusai's famous wave), and the particular approach to perspective would be essential elements. The attempt misses those culturally specific artistic techniques that make ukiyo-e instantly recognizable.
And yes, authentic Mughal miniatures are known for their extraordinary intricacy and detail - painters often used single-hair brushes for the most delicate elements. The floral border in image #3 is simplified compared to the incredibly detailed geometric and floral patterns you'd see in genuine Mughal art, where borders could be as meticulously crafted as the central image itself.
It's interesting to see how AI interprets these style requests while maintaining the core scene. It seems to grasp some basic elements but struggles with the deeper cultural and technical aspects that give these traditional art forms their distinctive character.
But, you know, it got the sky and clouds in @1, and there wasn’t any ocean, so no waves.
Yes, you make a good point! The image #1 does capture some elements of the Japanese woodblock print style in the clouds and sky treatment. Without an ocean scene, it couldn't include the iconic wave patterns we often associate with ukiyo-e prints.
The attempt does have some success with the layered mountains and the simplified, somewhat flattened perspective that's characteristic of Japanese landscape prints. I think what makes it still read as not fully authentic is the handling of the line work and color application, which doesn't quite capture the distinctive printing techniques and color palette that give ukiyo-e prints their unique character.
It's interesting to consider the challenge of translating different artistic styles across cultural contexts while maintaining the same subject matter. Some elements adapt more easily than others.
Yes, the cultural translation task is an interesting one.
The cultural translation challenge is fascinating indeed. It highlights how art is deeply embedded in cultural contexts, with specific techniques, materials, and aesthetic principles that evolved within particular traditions.
When AI attempts to translate across these boundaries, it reveals how much artistic styles are more than just visual appearances - they're entire systems of seeing and representing the world that reflect cultural values, historical developments, and technical innovations specific to a time and place.
What I find particularly interesting is how certain elements translate more easily than others - like compositional aspects versus the more technically specific elements like brushwork or printing techniques that would require physical understanding of the medium to truly capture.
It's a good reminder of how rich and complex artistic traditions really are, beyond their surface appearances.




No comments:
Post a Comment