Skip to content

Audio Transcription Support

The DialoX platform provides robust support for transcribing audio files through its Transcribe API. This allows you to convert spoken content into text that can be processed by your bots. It functions as a gateway into various underlying transcription providers, providing a unified interface for creating transcripts.

The transcription features can both be used through the Transcribe API and through the builtin transcribe() function.

Supported Features

  • Multiple audio file formats supported (wav, MP3, opus, etc.)
  • Language detection or explicit language selection
  • Multi-channel audio support (e.g. for call recordings), each audio channel is attributed to a different speaker

Using the Transcribe API

To transcribe an audio file, the audio-transcribe endpoint can be used.

This endpoint creates a new conversation in the given bot, and then transcribes the audio file. The transcript arrives as individual messages in the conversation, as well as a $text_transcript event which contains additional metadata about the transcription.

See the API documentation for more information.

Locale detection

In order to create a correct transcription, it is necessary to specify the language of the audio file. Some transcription providers have this detection built-in, while others require it to be specified explicitly.

The locale parameter of the transcription request can be set either to a language code (e.g. en, fr, de, etc.) to force the transcription to be in that language, or to detect, or auto.

  • detect: The DialoX platform will attempt to detect the language of the audio file based on the first 30 seconds of the file. Under the hood, this uses an LLM (Google / Gemini-1.5-flash) to detect the language, but this can be overridden by creating a prompt in your bot called detect_audio_language.
  • auto: Some providers have this detection built-in. If auto is specified, the platform will leave the detection to the provider.

The possibilities for the auto/detect values of the locale parameter are the following:

Speech Provider Supports Notes
Google auto, detect detect is supported on all models; auto is only supported on the Chirp models.
Google AI auto Locale parameter is ignored; Gemini models always detect the language itself.
Microsoft OpenAI auto Locale parameter is ignored; Whisper model always detects the language itself.
Speechmatics auto, detect Both are supported, but auto may fail for some inputs. detect is the default.
Deepgram auto, detect Both are supported; auto is the default.

Provider-specific transcription parameters ("extra")

The extra parameter is a metadata object that is passed to the transcription API request. It can contain any information, however, when it contains a key with the name of a provider, those are considered provider-specific parameters and are passed to the transcription API request.

Speechmatics

For Speechmatics, the entire extra.speechmatics object is merged with the transcription request in the create transcription job API call. For example, to override the transcript_config.domain parameter, which can be used to specify the domain of the audio file:

extra:
  speechmatics:
    transcript_config:
      domain: finance

Google AI

For Google AI (google_ai) provider, the extra parameter can contain a google_ai object with the following fields:

  • prompt: The name of the prompt to use for the transcription.
  • bindings: A dictionary of bindings to use for the prompt.

For example:

extra:
  google_ai:
    prompt:
      name: custom_audio_transcribe
    bindings:
      my_variable: "Value of my variable"

Deepgram

For Deepgram (deepgram) provider, the entire extra.deepgram object is merged with the transcription request in the create transcription job API call.

Microsoft OpenAI (whisper )

For Microsoft OpenAI (whisper) provider, the extra parameter can contain a whisper_prompt field, which is a string that is passed to the transcription API request. Read the Whisper prompting guide for more information.

For example:

extra:
  whisper_prompt: "Hi there and welcome to the show."

Overriding transcription API settings in a bot

The transcription API settings can be overridden in a bot by creating a transcribe_config file in the bot's root directory.

This file should be a YAML file with the following fields:

override:
  provider: speechmatics
  locale: detect

The parameters that can be overridden are the following:

  • provider: The transcription provider to use. See the API docs for the available providers.
  • model: The provider-specific transcription model to use.
  • locale: The language to transcribe the audio in, or detect, or auto.
  • extra: Extra parameters to pass to the transcription API request. See below.
  • no_speech_prob_cutoff: The cutoff to consider when converting transcriptions to messages. The cutoff is a float between 0 and 1, where 0 is the most strict (only words with 100% confidence are considered), and 1 is the most lenient (all words are considered). It defaults to 0.5.

See the API documentation for more information.

Transcription variants

By specyfing a variants object in the transcribe_config file, you can specify multiple transcription variants for a single audio file. The variants object is an object with the key being a name for the variant, and the value being a object just like the override object.

This is useful if you want to transcribe the same audio file in parallel using different providers or models, to evaluate the quality of the transcription.

variants:
  speechmatics:
    provider: speechmatics
    locale: detect
  google_chirp:
    provider: google
    model: chirp_2
    locale: auto