We find that most of the information for linguistic features (e.g., grammatical gender, number and animacy) are located in subsets of 5 to 10 neurons. We are able to find these subsets with a novel generative probe based on a Gaussian assumption.
— Lucas Torroba Hennigen (@ltorroba1) October 20, 2020
Another exciting bit of related research (coming to a conference near you!) is that mBERT encodes linguistic features using the same neurons across distinct languages. For instance, Hindi and Latvian use the overlapping neurons for number and gender. This was quite unexpected! pic.twitter.com/HzKwXu6sSE
— Lucas Torroba Hennigen (@ltorroba1) October 20, 2020
Experimental work conducted on 36 languages from the Universal Dependencies corpus. Joint work with @adinamwilliams and @ryandcotterell
— Lucas Torroba Hennigen (@ltorroba1) October 20, 2020
Abstract for the linked article:
Most modern NLP systems make use of pretrained contextual representations that attain astonishingly high performance on a variety of tasks. Such high performance should not be possible unless some form of linguistic structure inheres in these representations, and a wealth of research has sprung up on probing for it. In this paper, we draw a distinction between intrinsic probing, which examines how linguistic information is structured within a representation, and the extrinsic probing popular in prior work, which only argues for the presence of such information by showing that it can be successfully extracted. To enable intrinsic probing, we propose a novel framework based on a decomposable multivariate Gaussian probe that allows us to determine whether the linguistic information in word embeddings is dispersed or focal. We then probe fastText and BERT for various morphosyntactic attributes across 36 languages. We find that most attributes are reliably encoded by only a few neurons, with fastText concentrating its linguistic structure more than BERT.
No comments:
Post a Comment