Audio Transcription Support¶
The DialoX platform provides robust support for transcribing audio files through its Transcribe API. This allows you to convert spoken content into text that can be processed by your bots. It functions as a gateway into various underlying transcription providers, providing a unified interface for creating transcripts.
The transcription features can both be used through the Transcribe
API and through the builtin
transcribe()
function.
Supported Features¶
- Multiple audio file formats supported (wav, MP3, opus, etc.)
- Language detection or explicit language selection
- Multi-channel audio support (e.g. for call recordings), each audio channel is attributed to a different speaker
Using the Transcribe API¶
To transcribe an audio file, the audio-transcribe
endpoint can be used.
This endpoint creates a new conversation in the given bot, and then transcribes the audio file. The transcript arrives
as individual messages in the conversation, as well as a $text_transcript
event which contains additional metadata
about the transcription.
See the API documentation for more information.
Locale detection¶
In order to create a correct transcription, it is necessary to specify the language of the audio file. Some transcription providers have this detection built-in, while others require it to be specified explicitly.
The locale
parameter of the transcription request can be set either to a language code (e.g. en
, fr
, de
, etc.)
to force the transcription to be in that language, or to detect
, or auto
.
detect
: The DialoX platform will attempt to detect the language of the audio file based on the first 30 seconds of the file. Under the hood, this uses an LLM (Google / Gemini-1.5-flash) to detect the language, but this can be overridden by creating a prompt in your bot calleddetect_audio_language
.auto
: Some providers have this detection built-in. Ifauto
is specified, the platform will leave the detection to the provider.
The possibilities for the auto/detect values of the locale parameter are the following:
Speech Provider | Supports | Notes |
---|---|---|
auto , detect |
detect is supported on all models; auto is only supported on the Chirp models. |
|
Google AI | auto |
Locale parameter is ignored; Gemini models always detect the language itself. |
Microsoft OpenAI | auto |
Locale parameter is ignored; Whisper model always detects the language itself. |
Speechmatics | auto , detect |
Both are supported, but auto may fail for some inputs. detect is the default. |
Deepgram | auto , detect |
Both are supported; auto is the default. |
Provider-specific transcription parameters ("extra")¶
The extra
parameter is a metadata object that is passed to the transcription API request. It can contain any
information, however, when it contains a key with the name of a provider, those are considered provider-specific
parameters and are passed to the transcription API request.
Speechmatics¶
For Speechmatics, the entire extra.speechmatics
object is merged with the transcription request in the create
transcription job API call. For example, to override
the transcript_config.domain
parameter, which can be used to specify the domain of the audio file:
extra:
speechmatics:
transcript_config:
domain: finance
Google AI¶
For Google AI (google_ai
) provider, the extra
parameter can contain a google_ai
object with the following fields:
prompt
: The name of the prompt to use for the transcription.bindings
: A dictionary of bindings to use for the prompt.
For example:
extra:
google_ai:
prompt:
name: custom_audio_transcribe
bindings:
my_variable: "Value of my variable"
Deepgram¶
For Deepgram (deepgram
) provider, the entire extra.deepgram
object is merged with the transcription request in the
create transcription job API call.
Microsoft OpenAI (whisper )¶
For Microsoft OpenAI (whisper
) provider, the extra
parameter can contain a whisper_prompt
field, which is a string
that is passed to the transcription API request. Read the Whisper prompting
guide for more information.
For example:
extra:
whisper_prompt: "Hi there and welcome to the show."
Overriding transcription API settings in a bot¶
The transcription API settings can be overridden in a bot by creating a transcribe_config
file in the bot's root
directory.
This file should be a YAML file with the following fields:
override:
provider: speechmatics
locale: detect
The parameters that can be overridden are the following:
provider
: The transcription provider to use. See the API docs for the available providers.model
: The provider-specific transcription model to use.locale
: The language to transcribe the audio in, ordetect
, orauto
.extra
: Extra parameters to pass to the transcription API request. See below.no_speech_prob_cutoff
: The cutoff to consider when converting transcriptions to messages. The cutoff is a float between 0 and 1, where 0 is the most strict (only words with 100% confidence are considered), and 1 is the most lenient (all words are considered). It defaults to 0.5.
See the API documentation for more information.
Transcription variants¶
By specyfing a variants
object in the transcribe_config
file, you can specify multiple transcription variants for a
single audio file. The variants
object is an object with the key being a name for the variant, and the value being a
object just like the override
object.
This is useful if you want to transcribe the same audio file in parallel using different providers or models, to evaluate the quality of the transcription.
variants:
speechmatics:
provider: speechmatics
locale: detect
google_chirp:
provider: google
model: chirp_2
locale: auto