Skip to content

Audio Transcription Support

The DialoX platform provides robust support for transcribing audio files through its Transcribe API. This allows you to convert spoken content into text that can be processed by your bots. It functions as a gateway into various underlying transcription providers, providing a unified interface for creating transcripts.

The transcription features can both be used through the Transcribe API and through the builtin transcribe() function.

Supported Features

  • Multiple audio file formats supported (wav, MP3, opus, etc.)
  • Language detection or explicit language selection
  • Multi-channel audio support (e.g. for call recordings), each audio channel is attributed to a different speaker

Using the Transcribe API

To transcribe an audio file, the audio-transcribe endpoint can be used.

This endpoint creates a new conversation in the given bot, and then transcribes the audio file. The transcript arrives as individual messages in the conversation, as well as a $text_transcript event which contains additional metadata about the transcription.

See the API documentation for more information.

Locale detection

In order to create a correct transcription, it is necessary to specify the language of the audio file. Some transcription providers have this detection built-in, while others require it to be specified explicitly.

The locale parameter of the transcription request can be set either to a language code (e.g. en, fr, de, etc.) to force the transcription to be in that language, or to detect, or auto.

  • detect: The DialoX platform will attempt to detect the language of the audio file based on the first 30 seconds of the file. Under the hood, this uses an LLM (Google / Gemini-1.5-flash) to detect the language, but this can be overridden by creating a prompt in your bot called detect_audio_language.
  • auto: Some providers have this detection built-in. If auto is specified, the platform will leave the detection to the provider.

The possibilities for the auto/detect values of the locale parameter are the following:

Speech Provider Supports Notes
Google auto, detect detect is supported on all models; auto is only supported on the Chirp models.
Google AI auto Locale parameter is ignored; Gemini models always detect the language itself.
Microsoft OpenAI auto Locale parameter is ignored; Whisper model always detects the language itself.
Speechmatics auto, detect Both are supported, but auto may fail for some inputs. detect is the default.
Deepgram auto, detect Both are supported; auto is the default.
Juvoly - Locale parameter is ignored; Juvoly models always detect the language itself.

Provider-specific transcription parameters ("extra")

The extra parameter is a metadata object that is passed to the transcription API request. It can contain any information, however, when it contains a key with the name of a provider, those are considered provider-specific parameters and are passed to the transcription API request.

Google

For Google (google) provider, the extra parameter can contain a google object with the following fields:

  • processing_strategy: Set to DYNAMIC_BATCHING to enable lower-cost, higher-latency transcription for longer audio files. See Google's documentation for details.

Example:

extra:
  google:
    processing_strategy: DYNAMIC_BATCHING

Speechmatics

For Speechmatics, the entire extra.speechmatics object is merged with the transcription request in the create transcription job API call. For example, to override the transcript_config.domain parameter, which can be used to specify the domain of the audio file:

extra:
  speechmatics:
    transcript_config:
      domain: finance

Google AI

For Google AI (google_ai) provider, the extra parameter can contain a google_ai object with the following fields:

  • prompt: The name of the prompt to use for the transcription.
  • bindings: A dictionary of bindings to use for the prompt.

For example:

extra:
  google_ai:
    prompt:
      name: custom_audio_transcribe
    bindings:
      my_variable: "Value of my variable"

Deepgram

For Deepgram (deepgram) provider, the entire extra.deepgram object is merged with the transcription request in the create transcription job API call.

Microsoft OpenAI (whisper )

For Microsoft OpenAI (whisper) provider, the extra parameter can contain a whisper_prompt field, which is a string that is passed to the transcription API request. Read the Whisper prompting guide for more information.

For example:

extra:
  whisper_prompt: "Hi there and welcome to the show."

Overriding transcription API settings in a bot

The transcription API settings can be overridden in a bot by creating a transcribe_config file in the bot's root directory.

This file should be a YAML file with the following fields:

override:
  provider: speechmatics
  locale: detect

The parameters that can be overridden are the following:

  • provider: The transcription provider to use. See the API docs for the available providers.
  • model: The provider-specific transcription model to use.
  • locale: The language to transcribe the audio in, or detect, or auto.
  • extra: Extra parameters to pass to the transcription API request. See below.
  • no_speech_prob_cutoff: The cutoff to consider when converting transcriptions to messages. The cutoff is a float between 0 and 1, where 0 is the most strict (only words with 100% confidence are considered), and 1 is the most lenient (all words are considered). It defaults to 0.5.

See the API documentation for more information.

Transcription variants

By specyfing a variants object in the transcribe_config file, you can specify multiple transcription variants for a single audio file. The variants object is an object with the key being a name for the variant, and the value being a object just like the override object.

This is useful if you want to transcribe the same audio file in parallel using different providers or models, to evaluate the quality of the transcription.

variants:
  speechmatics:
    provider: speechmatics
    locale: detect
  google_chirp:
    provider: google
    model: chirp_2
    locale: auto

Conditional Transcription Settings

The transcribe_config supports conditional overrides based on audio file characteristics. This allows selecting different transcription configurations depending on the properties of the audio file.

Duration-based Selection

You can specify different transcription configurations based on the duration of the audio file. This is useful for optimizing transcription quality and cost - for example, using a faster but less accurate model for short files, and a more accurate but slower/more expensive model for longer files.

Example configuration:

conditional_overrides:
  - duration:
      lt: 60 # Less than 60 seconds
    override:
      provider: microsoft_openai
      model: whisper
  - duration:
      gt: 120 # Greater than 120 seconds
    override:
      provider: speechmatics
      extra:
        speechmatics:
          transcript_config:
            domain: finance

# Default override (used when no conditions match)
override:
  provider: google
  model: chirp_2
Available Duration Conditions
  • lt: Less than (duration < value)
  • lte: Less than or equal to (duration <= value)
  • gt: Greater than (duration > value)
  • gte: Greater than or equal to (duration >= value)
  • eq: Equal to (duration == value)

Duration is specified in seconds. If more than one operator are given in the duration map, they will be combined with AND.

Example with multiple operators:

conditional_overrides:
  - duration:
      gte: 60 # Audio file needs to be greater than or equal to 60 seconds
      lt: 120 # AND less than 120 seconds
    override:
      provider: deepgram

Conditionally discarding transcriptions

In some cases, you may want to skip ("discard") transcription entirely for certain audio files, such as those that are too short or do not meet your criteria. The discard option can be used in a conditional override to accomplish this. When a condition matches and the override contains a discard field, the transcription job will be skipped and the provided reason will be logged.

Example configuration:

conditional_overrides:
  - duration:
      lt: 3 # Less than 3 seconds
    override:
      discard: "Audio file too short"

In this example, if the audio file is less than 3 seconds, the transcription will be discarded and the reason "Audio file too short" will be logged. No transcription request will be sent to any provider for that audio file.

This is useful for filtering out audio that is too short, silent, or otherwise not worth transcribing, saving resources and providing clear auditability for skipped files.