Evaluating and processing conversations

Evaluating and processing conversations

For readers in a hurry

  • Transcription converts spoken words into written text. This text can then be used in a variety of ways in the corporate context. We call this speech automation.
  • Summaries of conversations, video conferences or YouTube videos are the best-known use cases. However, AI can be used to create numerous other application-specific reports and trigger further automation.
  • The prerequisite is to clearly identify the interviewees in the recording and to assign the texts to them correctly and precisely. This procedure is called diarization ("diary keeping").
  • Diarization enables the speaker-specific interpretation of content and its use. It is the basis for automatically generated doctor's letters, lawyer-client conversations, order documentation in banking and insurance and much more.
  • In addition, follow-up processes can be triggered automatically if, for example, a supervisor approves a measure in a meeting, which then initiates and completes an approval process in the ERP system.

Tip to try out

If you use ChatGPT, you should take a look at the new prompt guide from OpenAI. The manufacturer of ChatGPT has published its own prompt creation guide , which explains what a good and meaningful prompt in ChatGPT - and also via the API - should look like so that the result is of the highest possible quality. It should be emphasized that OpenAI generally writes very comprehensible documentation so that even non-IT specialists can get the best out of ChatGPT, DALL-E and Whisper.

Actions require precision

If transcription is to go beyond pure speech recognition and translation of spoken words and sentences, it is necessary to clearly assign what has been said to individual speakers.

Video conferencing manufacturers such as Microsoft Teams, Zoom, Google Meet, GoToMeeting or Cisco WebEx can already identify each speaker in their products and assign their statements precisely, as each video conference participant uses their own channel. This basically works reliably, if we disregard minor assignment errors during interruptions (e.g. when "interrupting").

If, for example, medical documentation is to be created automatically on the basis of one or more doctor-patient consultations and automatically imported into the hospital information system or practice system for documentation purposes, the use of the aforementioned video conferencing systems is often not practical. Although the doctor can help themselves by speaking the essential information into their smartphone during or after the appointment and an automatic transcription process takes place from there, there is an understandable desire to process the normal doctor-patient consultation directly so that the patient can be given full attention.


AI-based transcription platforms such as the OpenAI Whisper model can convert entire audio files into text files - and thus make them accessible for further processing - but they do not offer the possibility of identifying the individual speakers, so that misinterpretations of the AI model occur if, for example, the complaints at the beginning of the hospital admission report are to be listed separately.

Other AI models are therefore used to identify the speakers (e.g. doctor, patient, nurse, relative, etc.). They are called diarization models and return a list of entries showing which speaker has said something from which second to which second.

This information is then used to process the recording into text using transcription models so that the information about who said what can be used in the subsequent text analysis, which is also AI-based. This is important when differentiating content. For example, the complaint comes from the patient, while the treatment suggestion comes from the doctor. If there is no vocal differentiation - as is the case with text - no computer can clearly assign what has been said. Misinterpretations would increasingly creep in, which we need to avoid, especially in critical areas.

Use cases

This combination of several AI models enables the automation of industry-specific use cases.

For example, doctor's letters and care reports can be produced automatically and delivered to the desired addressee, lawyers and tax consultants can record the results of their consultations and the next steps agreed with their client in the digital file, banks and insurance companies can not only track orders and customer interactions but also immediately initiate automated actions such as purchase or sales orders or the sending of a policy.

Customer service desks and helpdesks can record bookings with specific details provided by the customer during the call or activate or deactivate licenses for the call partner.

What all use cases have in common is that the artificial intelligence is able to interpret the meaning of the conversation and put it into context thanks to the assignment to the conversation partner. This allows further automation processes to be initiated in subsequent systems without explicit human action. Human communication is used to solve problems, while the implementation is carried out automatically thanks to AI.

Transcription with Diarization opens up completely new possibilities for companies in any industry to automate their day-to-day business in order to increase their own productivity, expand their competitive advantage and increase the satisfaction of their employees by eliminating monotonous tasks.

In short: words are followed by deeds.

About Business Automatica GmbH:

Business Automatica reduces process costs by automating manual activities, increases the quality of data exchange in complex system architectures and connects on-premise systems with modern cloud and SaaS architectures. Applied artificial intelligence in the company is an integral part of this. Business Automatica also offers automation solutions from the cloud that are geared towards cyber security.