In its annual State of ASR report, 3Play Media finds that automatic speech recognition technology has advanced, but human intervention is still required for accurate captioning use cases.
3Play Media, the leading media accessibility provider, has released its annual report on the state of Automatic Speech Recognition (ASR). The study examines the overall state of text-to-speech technology and assesses the performance of 9 leading speech recognition engines in captioning and transcription. According to the study, the accuracy of the technology has improved significantly since the company’s last report, published in January 2021.
3Play Media tested all 9 engines using a large dataset representative of 3Play Media’s diverse customer base. Accuracy was assessed against two measures: word error rate (WER) and format error rate (FER), which includes errors in formatting such as grammar, speaker identification and unvoiced elements in addition to word errors.
Marketing Technology News: Deborah Besemer Retires as Brightcove Board Chair
“As the AI models driving ASR continue to evolve, many of the engines we evaluated have shown significant progress in their transcription accuracy over the past two years”
In the WER and FER measures, Speechmatics with 3Play modeling and post-processing led the pack, followed by Speechmatics alone and Microsoft. Rev, Google VM and Voicegain followed, each with respectable scores close enough that these providers are hard to tell apart. Despite exciting improvements across the board, all motors performed well below the industry standard of 99% accuracy, confirming that ASR alone is still not “good enough” for comply with legal requirements for closed captioning.
“As the AI models driving ASR continue to evolve, many of the engines we have evaluated have shown significant progress in their transcription accuracy over the past two years,” said Chris Antunes, co-CEO and co-founder of 3Play Media. “We publish this report every year because we use ASR in our own transcription process, and we have a vested interest in ensuring that we are using the best engine on the market. Speechmatics remains an undisputed industry leader in pre-recorded and live automated transcription, and the application of 3Play’s mappings and post-processing has resulted in an exciting improvement in word error rate of over 8%.
Marketing Technology News: MarTech Interview with Werner Kunz-Cho, CEO of Farereportal
The study showed a wide range of accuracy among the technologies tested, with the best and worst performing engines differing by more than 15 percentage points. This suggests that different engines are optimized for different purposes and that some ASR engines will not work well for transcription. Compared to other uses of text-to-speech technology, such as automated assistants capable of training on a specific voice, transcription is a very difficult task, with variables such as diverse sentence structure and speech spontaneous, specialized terminology and complex models involving several speakers. , accents and background noise.
Accuracy is essential in captioning for a number of reasons, the most important being that people who are d/Deaf or hard of hearing rely on captions as a coping solution. Accurate captions also improve viewer engagement. Studies show that subtitles improve watch time, brand recognition and understanding. And, as customer experience has become a critical driver for businesses, so has digital accessibility legislation: in 2021 alone, 10 accessibility lawsuits were filed per day.