Max Little at Language Log, Adversarial attacks on modern speech-to-text:
Accumulated evidence over the last few years shows that empirically, methods (such as the Mozilla DeepSpeech architecture in the example above) based on deep learning and massive amounts of labelled data ("thousands of hours of labeled audio"), seem to outperform earlier methods, when compared on the test set. In fact, the performance on the test set can be quite extraordinary, e.g. word error rates (WERs) as low as 6.5% (which comes close to human performance, around 5.8% on their training dataset). These algorithms often have 100's of millions of parameters (Mozilla DeepSpeech has 120 million) and model training can take hundreds of hours on massive amounts of hardware (usually GPUs, which have dedicated arrays of co-processors). So, these algorithms are clearly exquisitely tuned to the STT task for the particular distribution of the given dataset.The crucial weakness here — what the adversarial attack exploits — is their manifest success, i.e. very low WER on the given dataset distribution. But because they are so effective at this task, they have what might best be described as huge "blind spots". Adversarial attacks work by learning how to change the input in tiny steps such as to force the algorithm into any desired classification output. This turns out to be surprisingly easy and has been demonstrated to work for just about every kind of deep learning classifier.Current machine learning systems, even sophisticated deep learning methods, are only able to solve the problem they are set up to solve, and that can be a very specific problem. This may seem obvious but the hyperbole that accompanies any deep learning application (coupled with clear lack of analytical understanding how these algorithms actually work) often provokes a lot of what might best be described as "magical thinking" about their extraordinary powers as measured by some single error metric.So, the basic fact is that if they are set up to map sequences of spectrogram feature vectors to sequences of phoneme labels in such a way as to minimize the WER on that dataset distribution, then that is the only task they can do. It is important not to fall into the magical thinking trap about modern deep learning-based systems. Clearly, these algorithms have not somehow "learned" language in the way that humans understand it. They have no "intelligence" that we would recognize.
That is to say, these systems are as brittle in their own way as old school symbolic AI systers are. Thus, Little "would not recommend them for scientific or legal annotation applications".