Speech recognition with AI

For readers in a hurry
- OpenAI's Whisper model has developed into a serious competitor in speech recognition for established providers such as Dragon Naturally Speaking or Wolters Kluwer DictNow.
- Speech recognition is the basis of many AI-based automations such as meeting summaries, medical doctor-patient conversation documentation, voice search in company data or even specific translation services.
- Whisper achieves excellent results even in its basic model (70 million parameters). Not only can dictations be summarized perfectly, but conversations can also be logged and the measures agreed upon automatically extracted.
- With the help of the language models now available, company-specific use cases can be implemented perfectly. Away from tedious typing, towards targeted voice input and voice control of IT systems and machines.
Tip to try out
If you want to try out the use case of summarizing meeting content from MS Teams, Zoom or Google Meet etc., analyzing online meetings according to meeting shares and extracting next steps yourself, you should take a look at the providers fireflies.ai and read.me providers. Both connect to a video conference as a silent participant, record every word precisely and use it to create a conversation analysis according to a predefined pattern.
The crux of the matter
In the automation of business processes, the idea generation phase is always about finding the "jumping off point" (lat. punctum saltans). In other words, what proof must be provided so that an automation technology can be classified as effective and expedient for a specific use case?
In the case of AI-based automation, this is first and foremost the recognition rate ( accuracy): Does the artificial intelligence deliver a correct result? This question is so important because AI is based on stochastic processes that deliver the most probable result. This may not necessarily be the correct one, depending on how (well) the AI model has been trained.
In our linguistic use case, the crucial point is the correct transcription of the audio recording into a continuous text. This is what we are concentrating on.
Whisper - the whisper model
OpenAI provides an excellent speech recognition library with its Whisper model. We integrate this using a Python library. The only current peculiarity is that Whisper can only process 30-second audio files. Consequently, we have to split our recording into 30-second snippets in order to transcribe a longer recording.
Whisper reads these snippets one after the other and translates them into spectrograms. You can see an example in the cover picture above.
That's interesting! Whisper does not generate a text file directly from the audio file but takes the detour via a graphic artifact - the spectrogram. From this spectrogram, Whisper then uses pattern recognition not only to identify the spoken language (e.g. German or English) but also to decode, i.e. transcribe, the spoken text. This is where the GPT approach pursued by OpenAI comes to the fore: turn patterns into numbers, and numbers into - in this case - text.
This is done with every 30-second snippet. At the end, the transcribed text is strung together: the transcription is complete.
Cui bono?
Where can Whisper be useful in the corporate environment? In addition to its high recognition accuracy, the simplicity of using the Whisper library is impressive. It therefore forms a welcome basis for specific applications of artificial intelligence to problems in the company. Here are some examples:
- Summaries of meetings and negotiations (meeting notes)
- Searching longer recordings for specific content discussed (semantic search)
- Voice control of downstream software (e.g. ERP system) and communication with machines (natural voice control)
- Spoken interaction with AI agents for the automated creation of data analyses, evaluations and dashboards (speaking instead of writing)
People who want to interact with computer systems using spoken language instead of typing will appreciate this possibility. Whisper not only paves the way for the analysis and utilization of spoken language as an object of interest; it also opens up a new interaction with enterprise applications and enterprise data, as we know it from consumer software such as Apple's Siri.
The spoken word counts again.
About Business Automatica GmbH:
Business Automatica reduces process costs by automating manual activities, increases the quality of data exchange in complex system architectures and connects on-premise systems with modern cloud and SaaS architectures. Applied artificial intelligence in the company is an integral part of this. Business Automatica also offers automation solutions from the cloud that are geared towards cyber security.