Tuesday, October 4, 2022

Linearly Mapping from Image to Text Space

The paper's conclusion:

In this paper, we test the extent to which the representations of language models encode information about the non-linguistic world in terms of their ability to use image representations to perform vision-language tasks. We show through LiMBeR (Linearly Mapping Between Representation spaces) that training a linear (thus, distance-preserving) transformation to connect image features to an LM’s input space is competitive on image captioning and visual question answering benchmarks with similar models like MAGMA that tune both image and text networks. However, we also find that such transfer is highly dependant on the amount of linguistic supervision the image encoder backbone had during its pretraining phase. BEIT, which is a vision-only image encoder underperforms compared to CLIP, which was pretrained with natural language captions. We explore what conceptual information transfers successfully, and find through probing, clustering, and analysis of generated text that the representational similarity between LMs and vision-only image representations is mostly restricted to coarse-grained concepts of perceptual features. Our findings indicate that large LMs do appear to form models of the visual world along these perceptual concepts to some extent, but are biased to form categorical concepts of words that are not distinguished by vision-only models. We are excited by future work applying LiMBeR to other domains and modalities as a behavioral tool for understanding the representations of LMs and other deep neural networks.

No comments:

Post a Comment