Audio Transcription Support¶
The DialoX platform provides robust support for transcribing audio files through its Transcribe API. This allows you to convert spoken content into text that can be processed by your bots. It functions as a gateway into various underlying transcription providers, providing a unified interface for creating transcripts.
The transcription features can both be used through the Transcribe
API and through the builtin
transcribe()
function.
Supported Features¶
- Multiple audio file formats supported (wav, MP3, opus, etc.)
- Language detection or explicit language selection
- Multi-channel audio support (e.g. for call recordings), each audio channel is attributed to a different speaker
Using the Transcribe API¶
To transcribe an audio file, the audio-transcribe
endpoint can be used.
This endpoint creates a new conversation in the given bot, and then transcribes the audio file. The transcript arrives
as individual messages in the conversation, as well as a $text_transcript
event which contains additional metadata
about the transcription.
See the API documentation for more information.
Locale detection¶
In order to create a correct transcription, it is necessary to specify the language of the audio file. Some transcription providers have this detection built-in, while others require it to be specified explicitly.
The locale
parameter of the transcription request can be set either to a language code (e.g. en
, fr
, de
, etc.)
to force the transcription to be in that language, or to detect
, or auto
.
detect
: The DialoX platform will attempt to detect the language of the audio file based on the first 30 seconds of the file. Under the hood, this uses an LLM (Google / Gemini-1.5-flash) to detect the language, but this can be overridden by creating a prompt in your bot calleddetect_audio_language
.auto
: Some providers have this detection built-in. Ifauto
is specified, the platform will leave the detection to the provider.
The possibilities for the auto/detect values of the locale parameter are the following:
Speech Provider | Supports | Notes |
---|---|---|
auto , detect |
detect is supported on all models; auto is only supported on the Chirp models. |
|
Google AI | auto |
Locale parameter is ignored; Gemini models always detect the language itself. |
Microsoft OpenAI | auto |
Locale parameter is ignored; Whisper model always detects the language itself. |
Speechmatics | auto , detect |
Both are supported, but auto may fail for some inputs. detect is the default. |
Deepgram | auto , detect |
Both are supported; auto is the default. |
Juvoly | - | Locale parameter is ignored; Juvoly models always detect the language itself. |
Provider-specific transcription parameters ("extra")¶
The extra
parameter is a metadata object that is passed to the transcription API request. It can contain any
information, however, when it contains a key with the name of a provider, those are considered provider-specific
parameters and are passed to the transcription API request.
Google¶
For Google (google
) provider, the extra
parameter can contain a google
object with the following fields:
processing_strategy
: Set toDYNAMIC_BATCHING
to enable lower-cost, higher-latency transcription for longer audio files. See Google's documentation for details.
Example:
extra:
google:
processing_strategy: DYNAMIC_BATCHING
Speechmatics¶
For Speechmatics, the entire extra.speechmatics
object is merged with the transcription request in the create
transcription job API call. For example, to override
the transcript_config.domain
parameter, which can be used to specify the domain of the audio file:
extra:
speechmatics:
transcript_config:
domain: finance
Google AI¶
For Google AI (google_ai
) provider, the extra
parameter can contain a google_ai
object with the following fields:
prompt
: The name of the prompt to use for the transcription.bindings
: A dictionary of bindings to use for the prompt.
For example:
extra:
google_ai:
prompt:
name: custom_audio_transcribe
bindings:
my_variable: "Value of my variable"
Deepgram¶
For Deepgram (deepgram
) provider, the entire extra.deepgram
object is merged with the transcription request in the
create transcription job API call.
Microsoft OpenAI (whisper )¶
For Microsoft OpenAI (whisper
) provider, the extra
parameter can contain a whisper_prompt
field, which is a string
that is passed to the transcription API request. Read the Whisper prompting
guide for more information.
For example:
extra:
whisper_prompt: "Hi there and welcome to the show."
Overriding transcription API settings in a bot¶
The transcription API settings can be overridden in a bot by creating a transcribe_config
file in the bot's root
directory.
This file should be a YAML file with the following fields:
override:
provider: speechmatics
locale: detect
The parameters that can be overridden are the following:
provider
: The transcription provider to use. See the API docs for the available providers.model
: The provider-specific transcription model to use.locale
: The language to transcribe the audio in, ordetect
, orauto
.extra
: Extra parameters to pass to the transcription API request. See below.no_speech_prob_cutoff
: The cutoff to consider when converting transcriptions to messages. The cutoff is a float between 0 and 1, where 0 is the most strict (only words with 100% confidence are considered), and 1 is the most lenient (all words are considered). It defaults to 0.5.
See the API documentation for more information.
Transcription variants¶
By specyfing a variants
object in the transcribe_config
file, you can specify multiple transcription variants for a
single audio file. The variants
object is an object with the key being a name for the variant, and the value being a
object just like the override
object.
This is useful if you want to transcribe the same audio file in parallel using different providers or models, to evaluate the quality of the transcription.
variants:
speechmatics:
provider: speechmatics
locale: detect
google_chirp:
provider: google
model: chirp_2
locale: auto
Conditional Transcription Settings¶
The transcribe_config
supports conditional overrides based on audio file characteristics. This allows selecting different transcription configurations depending on the properties of the audio file.
Duration-based Selection¶
You can specify different transcription configurations based on the duration of the audio file. This is useful for optimizing transcription quality and cost - for example, using a faster but less accurate model for short files, and a more accurate but slower/more expensive model for longer files.
Example configuration:
conditional_overrides:
- duration:
lt: 60 # Less than 60 seconds
override:
provider: microsoft_openai
model: whisper
- duration:
gt: 120 # Greater than 120 seconds
override:
provider: speechmatics
extra:
speechmatics:
transcript_config:
domain: finance
# Default override (used when no conditions match)
override:
provider: google
model: chirp_2
Available Duration Conditions¶
lt
: Less than (duration < value)lte
: Less than or equal to (duration <= value)gt
: Greater than (duration > value)gte
: Greater than or equal to (duration >= value)eq
: Equal to (duration == value)
Duration is specified in seconds. If more than one operator are given in the duration
map, they will be combined with AND.
Example with multiple operators:
conditional_overrides:
- duration:
gte: 60 # Audio file needs to be greater than or equal to 60 seconds
lt: 120 # AND less than 120 seconds
override:
provider: deepgram
Conditionally discarding transcriptions¶
In some cases, you may want to skip ("discard") transcription entirely for certain audio files, such as those that are too short or do not meet your criteria. The discard
option can be used in a conditional override to accomplish this. When a condition matches and the override contains a discard
field, the transcription job will be skipped and the provided reason will be logged.
Example configuration:
conditional_overrides:
- duration:
lt: 3 # Less than 3 seconds
override:
discard: "Audio file too short"
In this example, if the audio file is less than 3 seconds, the transcription will be discarded and the reason "Audio file too short" will be logged. No transcription request will be sent to any provider for that audio file.
This is useful for filtering out audio that is too short, silent, or otherwise not worth transcribing, saving resources and providing clear auditability for skipped files.