Following up "Shelties On Alki Story Forest" (11/26/2019) and "The right boot of the warner of the baron" (12/6/2019), here's some recent testimony from engineers at Google about the brittleness of contemporary speech-to-text systems: Arun Narayanan et al., "Recognizing Long-Form Speech Using Streaming End-To-End Models", arXiv 10/24/2019.
The goal of that paper is to document some methods for making things better. But I want to underline the fact that considerable headroom remains, even with the massive amounts of training material and computational resources available to a company like Google.
Modern AI (almost) works because of machine learning techniques that find patterns in training data, rather than relying on human programming of explicit rules. A weakness of this approach has always been that generalization to material different in any way from the training set can be unpredictably poor. (Though of course rule- or constraint-based approaches to AI generally never even got off the ground at all.) "End-to-end" techniques, which eliminate human-defined layers like words, so that speech-to-text systems learn to map directly between sound waveforms and letter strings, are especially brittle.
Read the whole thing; it's not very long.
FWIW, I note that the old rule-based systems (the ones that never got off the ground) were also brittle.
No comments:
Post a Comment