The NLP pipeline
NLP stands for Natural Language Processing: The automated analysis of user messages.
Whenever any message created by the user enters the system, being a text message or a transcribed speech utterance, it goes through a sequence of processing steps called the NLP pipeline.
Each step in the pipeline has their own responsibility: for instance language detection, message tokenization or intent classification.
Since DialoX release 2.52 this pipeline is configurable per bot so that it can
be tuned to the specific use case of the bot. This is done by creating a nlp
YAML file in the root of the bot, or in a subdirectory when building a skill.
The default pipeline YAML looks like the following. Each individual pipeline step is documented below.
# This is the default NLP pipeline
pipeline_steps:
- step: pattern_ignore
options:
pattern: "^[.]"
- step: markup_stripper
- step: auto_translator
- step: spacy_tokenizer
- step: duckling_entity_extractor
context_pipeline_steps:
- step: bml_intent_matcher
- step: gpt_local_intent_classifier
- id: bot_dialogflow
step: dialogflow_intent_classifier
options:
agent_from: bot
- step: llm_intent_classifier
- step: qna_intent_classifier
- step: llm_topic_intent_classifier
- id: defaults_dialogflow
step: dialogflow_intent_classifier
options:
agent_from: defaults
Automatic translation¶
Requires: []
, provides: [auto_translator]
Translates incoming messages into a target language when the message's conversation is configured to do so.
pipeline_steps:
- step: auto_translator
An operator can enable auto-translation for a conversation in the studio via the inbox; alternatively, this can be done by using the REST API.
Google-based Language detector¶
Requires: []
, provides: [language_detector]
Detects message language using Google's language detection API
pipeline_steps:
- step: language_detector
Enriches the message with:
stats.detected_language
— ISO code of the detected language.
Dialogflow Intent classifier¶
Requires: []
, provides: [intent_classifier, dialogflow_classifier]
Classifies against the Dialogflow agent; either the default one or the one from the bot
pipeline_steps:
- step: dialogflow_intent_classifier
options:
agent_from: bot
The step takes one option: agent_from
, which decides which
Dialogflow agent will be used. It takes one of two values:
-
When
agent_from
isbot
, the dialogflow agent will be the one that is configured through the Dialogflow integration setting in the 'integrations' tab of the studio, or alternatively, through adialogflow.json
file inside the bot. -
When
agent_from
isdefaults
, the "Botsquad defaults" platform-wide dialogflow agent will be used for intent resolution.
The option allow_fallback
option is a boolean which decides
whether any fallback intent that is detected in dialogflow will be
used as such or that fallback intents are ignored (which is the
default).
Enriches the message with:
intent
—%Bubble.Intent{}
struct which is filled when a Dialogflow intent is detected.
Markup stripper¶
Requires: []
, provides: [markup_stripper]
Strips all regular Markdown and Speech markdown tags from the input sentence.
pipeline_steps:
- markup_stripper
Duckling entity extractor¶
Requires: [tokenizer]
, provides: [duckling_entity_extractor]
Extracts entities from the text using the Duckling library. A tokenizer needs to be present in the pipeline in front of the Duckling step.
pipeline_steps:
- step: naive_tokenizer
- step: duckling_entity_extractor
The message's sents
are enriched with additional tokenizations
that correspond to the found Duckling entities. These can then be
used in BML match expressions in Bubblescript.
The entities that are extracted are listed in the Duckling
supported dimensions
table. The
names of the entities are lowercased and under_scored (eg
PhoneNumber
becomes phone_number
).
Microsoft CLU Intent classifier¶
Requires: []
, provides: [intent_classifier, clu_classifier]
Classifies against the given Microsoft CLU agent.
pipeline_steps:
- step: clu_intent_classifier
options:
provider_agent: my-agent
deployment: my-deploy-1
The step takes the provider_agent
as option under which the
intents are stored in the bot. It performs its query against the
given deployment (which, when empty, defaults to default
).
For credentials, it looks for msclu
-type integration under the
alias msclu
. The CLU project name is equal to the given provider
agent.
When the provider agent contains a /, the first part is the integration alias while the second part is the project name.
Enriches the message with:
intent
—%Bubble.Intent{}
struct which is filled when a Dialogflow intent is detected.
QnA Intent classifier¶
Requires: []
, provides: [intent_classifier, qna_intent_classifier]
Classifies against the QnA intents in the bot
pipeline_steps:
- step: qna_intent_classifier
Enriches the message with:
intent
—%Bubble.Intent{}
struct which is filled when a QnA intent is classified.
LLM-based intent classifier for global intents¶
Requires: []
, provides: [:llm_intent_classifier, :intent_classifier]
Uses ChatGPT to classify LLM intents
pipeline_steps:
- step: llm_intent_classifier
Enriches the message with:
intent
—%Bubble.Intent{}
struct which is filled when an intent is classified using the LLM.
GPT Prompt
The prompt
option specifies that a custom GPT prompt (from a
prompt yaml file) gets used for rendering the classifier prompt. By
default, it uses a builtin prompt, that looks like the following:
prompts:
- id: classify_llm_intents
label: Classify LLM intents
text: |
system: Classify the last user message to one of the following KEYWORDS: {{ intent_ids }}. When unsure, reply with: unknown
{% for i in intents %}
{% if i.description %}When {{ i.description }} reply: {{ i.id }}{% endif %}
{% endfor %}
---
assistant: {{ question }}
user: {{ text }}
While rendering the prompt, the following variables are available:
intent_ids
- The IDs of all LLM intents as a single comma separated string
question
- The last question asked by the bot
text
- The user utterance
intents
- All intents as a list. Each intent contains its own
utterances as well.
LLM Intent classifier for local intents¶
Requires: [:in_context]
, provides: [:local_intent_classifier]
Uses ChatGPT to classify the given local intents in the ask
context.
context_pipeline_steps:
- step: gpt_local_intent_classifier
options:
prompt: my_prompt_name
Enriches the message with:
intent
—%Bubble.Intent{}
struct which is filled when an intent is classified using GPT.
LLM Prompt
The prompt
option specifies that a custom LLM prompt (from a
prompt yaml file) gets used for rendering the classifier prompt. By
default, it uses a builtin prompt, that looks like the following:
prompts:
- id: classify_local_intents
label: Classify local intents
text: |
system: Classify the last user message to one of the following KEYWORDS: {{ intent_ids }}. When unsure, reply with: unknown
{% for i in intents %}
{% if i.description %}When {{ i.description }} reply: {{ i.id }}{% endif %}
{% endfor %}
EXAMPLES:
{% for i in intents %}{% for u in i.utterances %}
'user: {{ u.text }}' 'assistant: {{ i.id }}'
{% endfor %}{% endfor %}
---
assistant: {{ question }}
user: {{ text }}
While rendering the prompt, the following variables are available:
intent_ids
- The IDs of the local intents as a single comma separated string
examples
- The intent utterances formatted as a conversation
between user and agent. Use the [[ examples ]]
syntax to insert
these in the prompt.
question
- The last question asked by the bot
text
- The user utterance
intents
- All intents as a list. Each intent contains its own
utterances as well.
Spacy-based sentence tokenizer¶
Requires: []
, provides: [tokenizer, spacy_tokenizer, language_detector]
A tokenizer pipeline step which uses Spacy to tokenize the input, perform language detection, detect default entities and performs POS tagging.
pipeline_steps:
- step: spacy_tokenizer
Enriches the message with:
-
sents
— the message tokenizations, where each token is annotated as its POS tag or entity. Mostly used internally by BML. -
stats.detected_language
— ISO code of the language that was detected from the message text. -
stats.word_count
— The number of word tokens in the text. -
stats.token_count
— The total number of tokens in the text.
LLM-based intent classifier for topic-based intents¶
Requires: []
, provides: [:llm_topic_intent_classifier, :intent_classifier]
Uses ChatGPT to classify LLM intents
pipeline_steps:
- step: llm_topic_intent_classifier
Enriches the message with:
intent
—%Bubble.Intent{}
struct which is filled when an intent is classified using the LLM.
GPT Prompt
The prompt
option specifies that a custom GPT prompt (from a
prompt yaml file) gets used for rendering the classifier prompt. By
default, it uses a builtin prompt, that looks like the following:
prompts:
- id: classify_llm_topic_intents
label: Classify LLM topic intents
text: |
system: Classify the last user message to one of the following KEYWORDS: {{ intent_ids }}. When unsure, reply with: unknown
{% for i in intents %}
{% if i.description %}When {{ i.description }} reply: {{ i.id }}{% endif %}
{% endfor %}
---
assistant: {{ question }}
user: {{ text }}
While rendering the prompt, the following variables are available:
intent_ids
- The IDs of all LLM intents as a single comma separated string
question
- The last question asked by the bot
text
- The user utterance
intents
- All intents as a list. Each intent contains its own
utterances as well.
BML intent matcher¶
Requires: [tokenizer]
, provides: [intent_classifier, bml_classifier]
Processes the bot's intents via 'hard matches'.
Intents are classified based on the tokenized sentence using the BML
expressions from the match:
attribute of the intent and from a
literal match on the intent label. This allows for short-circuiting
of intent classification for skipping the more expensive (in terms
of processing time) QnA and Dialogflow classifiers.
The input message needs to be tokenized by a previous pipeline step before this step can be added to the pipeline.
pipeline_steps:
- step: naive_tokenizer
- step: bml_intent_matcher
Enriches the message with:
intent
—%Bubble.Intent{}
struct which is filled when a hard BML or label match is found.
Naive sentence tokenizer¶
Requires: []
, provides: [tokenizer, naive_tokenizer]
A naive tokenizer which uses a whitespace-based strategy to split the input string in sentences and words. Can be used in place of the spacy tokenizer for more faster processing.
The tokenizer does not do any POS taging or entity recognition, so BML expressions that use those will never match when this tokenizer is used.
pipeline_steps:
- step: naive_tokenizer
Enriches the message with:
-
sents
— the message tokenizations. Mostly used internally by BML. -
stats.word_count
— The number of word tokens in the text. -
stats.token_count
— The total number of tokens in the text.
Bubblescript task pipeline step¶
Requires: (configurable), provides: (configurable)
An experimental pipeline step which allows you to execute a Bubblescript task.
pipeline_steps:
- step: bubblescript
options:
perform: nlp_redact
requires: [tokenizer]
provides: language_detector
With accompanying Bubblescript task:
task nlp_redact do
message.text = replace(message.text, "shit", "****")
end
The task name must be given to the perform
option. The first task
that matches the guard will be executed, the other tasks with the
same name will be ignored.
The requires
option specifies which requirement(s) need to be
fulfilled in earlier pipeline steps, for instance, a tokenizer might
be required. The provides
option can state one or more symbolic
names that are being provided by this bubblescript step, for
instance, a language detector.
Task execution
The following global variables are set in the interpreter: message
and context
. The execution context does NOT contain the usual
user
, bot
or other globals.
The message
and context
that are given to the Bubblescript task
are mutable: like the above example shows, their contents can be
changed for use in subsequent pipeline steps.
In the task, set the variable nlp_pipeline_halt
to true
to prevent
the execution of the rest of the pipeline.
Pattern-based ignore¶
Requires: []
, provides: []
Skips the rest of the pipeline when the message matches a given regular expression.
pipeline_steps:
- step: pattern_ignore
options:
pattern: "^."