Mark Liberman at Langauge Log has a useful post on a piece that's been making the DH rounds, Arvind Narayanan, "Language necessarily contains human biases, and so will machines trained on language corpora", Freedom to Tinker 8/24/2016:
We show empirically that natural language necessarily contains human biases, and the paradigm of training machine learning on language corpora means that AI will inevitably imbibe these biases as well.
This all started in the 1960s, with Gerald Salton and the "vector space model". The idea was to represent a document as a vector of word (or "term") counts — which like any vector, represents a point in a multi-dimensional space. Then the similarity between two documents can be calculated by correlation-like methods, basically as some simple function of the inner product of the two term vectors. And natural-language queries are also a sort of document, though usually a rather short one, so you can use this general approach for document retrieval by looking for documents that are (vector-space) similar to the query. It helps if you weight the document vectors by inverse document frequency, and maybe use thesaurus-based term extension, and relevance feedback, and …A vocabulary of 100,000 wordforms results in a 100,000-dimensional vector, but there's no conceptual problem with that, and sparse-vector coding techniques means that there's no practical problem either. Except in the 1960s, digital "documents" were basically stacks of punched cards, and the market for digital document retrieval was therefore pretty small. Also, those were the days when people thought that artificial intelligence was applied logic — one of Marvin Minsky's students once told me that Minsky warned him "If you're counting higher than one, you're doing it wrong". Still, Salton's students (like Mike Lesk and Donna Harman) kept the flame alive.
Mark goes on to discuss Google's "PageRank", "latent semantic analysis" (LSA), and more recent models. Liberman notes:
He then turns to Narayanan's post.It didn't escape notice that this puts into effect the old idea of "distributional semantics", especially associated with Zellig Harris and John Firth, summarized in Firth's dictum that "you shall know a word by the company it keeps".