The NLP pipeline

NLP stands for Natural Language Processing: The automated analysis of user messages.

Whenever any message created by the user enters the system, being a text message or a transcribed speech utterance, it goes through a sequence of processing steps called the NLP pipeline.

Each step in the pipeline has their own responsibility: for instance language detection, message tokenization or intent classification.

Since DialoX release 2.52 this pipeline is configurable per bot so that it can be tuned to the specific use case of the bot. This is done by creating a nlp YAML file in the root of the bot, or in a subdirectory when building a skill.

Configuring Pipeline Steps¶

There are two main ways to define the steps in your nlp.yml file:

Explicitly Defining Steps: You can list all steps for the main pipeline using the pipeline_steps key and for the context-specific pipeline using the context_pipeline_steps key. If these keys are present, they define the entire pipeline for that section. An empty list (e.g., pipeline_steps: []) means no steps for that section.

# Example: Explicitly defining main pipeline steps
pipeline_steps:
  - step: markup_stripper
  - step: spacy_tokenizer
  - step: duckling_entity_extractor

# Example: Explicitly defining context pipeline steps
context_pipeline_steps:
  - step: bml_intent_matcher

Omitting Steps from Defaults: Instead of explicitly listing all steps, you can omit specific steps from the default pipeline. You can use the omit_steps key for the main pipeline and omit_context_steps for the context pipeline. These keys are only considered if their corresponding pipeline_steps or context_pipeline_steps keys are not present in your YAML.

Items in the omission list must be valid step names (e.g., "spacy_tokenizer") or step IDs (if a step in the default pipeline has a custom ID) that exist in the default pipeline for that section. If an invalid item is provided, an error will occur when the pipeline is built.
Providing an empty list (e.g., omit_steps: []) when pipeline_steps is not defined means you intend to use the full default main pipeline.

# Example: Using the default main pipeline but omitting 'spacy_tokenizer'
omit_steps:
  - spacy_tokenizer

# Example: Using the default context pipeline but omitting 'qna_intent_classifier'
omit_context_steps:
  - qna_intent_classifier

You cannot specify both pipeline_steps and omit_steps for the main pipeline in the same configuration. Similarly, context_pipeline_steps and omit_context_steps are mutually exclusive for the context pipeline. Doing so will result in an error.

If neither an explicit list (pipeline_steps or context_pipeline_steps) nor an omission list (omit_steps or omit_context_steps) is provided for a section, the system will automatically use the full default pipeline for that section.

The default pipeline YAML looks like the following. Each individual pipeline step is documented below.

# This is the default NLP pipeline
pipeline_steps:
- step: pattern_ignore
  options:
    pattern: "^[.]"
- step: markup_stripper
- step: auto_translator
- step: spacy_tokenizer
- step: duckling_entity_extractor

context_pipeline_steps:
- step: bml_intent_matcher
- step: gpt_local_intent_classifier
- id: bot_dialogflow
  step: dialogflow_intent_classifier
  options:
    agent_from: bot
- step: llm_intent_classifier
- step: qna_intent_classifier
- step: llm_topic_intent_classifier
- step: base_intent_classifier
- step: dialogflow_intent_classifier
  id: defaults_dialogflow
  options:
    agent_from: defaults

Automatic translation¶

Requires: [], provides: [auto_translator]

Translates incoming messages into a target language when the message's conversation is configured to do so.

pipeline_steps:
- step: auto_translator

An operator can enable auto-translation for a conversation in the studio via the inbox; alternatively, this can be done by using the REST API.

Google-based Language detector¶

Requires: [], provides: [language_detector]

Detects message language using Google's language detection API

pipeline_steps:
- step: language_detector

Enriches the message with:

stats.detected_language — ISO code of the detected language.

Dialogflow Intent classifier¶

Requires: [], provides: [intent_classifier, dialogflow_classifier]

Classifies against the Dialogflow agent; either the default one or the one from the bot

pipeline_steps:
- step: dialogflow_intent_classifier
  options:
    agent_from: bot

The step takes one option: agent_from, which decides which Dialogflow agent will be used. It takes one of two values:

When agent_from is bot, the dialogflow agent will be the one that is configured through the Dialogflow integration setting in the 'integrations' tab of the studio, or alternatively, through a dialogflow.json file inside the bot.
When agent_from is defaults, the "Botsquad defaults" platform-wide dialogflow agent will be used for intent resolution.

The option allow_fallback option is a boolean which decides whether any fallback intent that is detected in dialogflow will be used as such or that fallback intents are ignored (which is the default).

Enriches the message with:

intent — %Bubble.Intent{} struct which is filled when a Dialogflow intent is detected.

Markup stripper¶

Requires: [], provides: [markup_stripper]

Strips all regular Markdown and Speech markdown tags from the input sentence.

pipeline_steps:
- markup_stripper

Duckling entity extractor¶

Requires: [tokenizer], provides: [duckling_entity_extractor]

Extracts entities from the text using the Duckling library. A tokenizer needs to be present in the pipeline in front of the Duckling step.

pipeline_steps:
- step: naive_tokenizer
- step: duckling_entity_extractor

The message's sents are enriched with additional tokenizations that correspond to the found Duckling entities. These can then be used in BML match expressions in Bubblescript.

The entities that are extracted are listed in the Duckling supported dimensions table. The names of the entities are lowercased and under_scored (eg PhoneNumber becomes phone_number).

Microsoft CLU Intent classifier¶

Requires: [], provides: [intent_classifier, clu_classifier]

Classifies against the given Microsoft CLU agent.

pipeline_steps:
- step: clu_intent_classifier
  options:
    provider_agent: my-agent
    deployment: my-deploy-1

The step takes the provider_agent as option under which the intents are stored in the bot. It performs its query against the given deployment (which, when empty, defaults to default).

For credentials, it looks for msclu-type integration under the alias msclu. The CLU project name is equal to the given provider agent.

When the provider agent contains a /, the first part is the integration alias while the second part is the project name.

Enriches the message with:

intent — %Bubble.Intent{} struct which is filled when a Dialogflow intent is detected.

Platform-default base intent classifier based on LLM.¶

Requires: [], provides: [:intent_classifier]

pipeline_steps:
- step: base_intent_classifier

Enriches the message with:

intent — %Bubble.Intent{} struct which is filled when an intent is classified using the LLM.

Works similar to the regular LLM classifier, except that a system-default set of base intents is used.

QnA Intent classifier¶

Requires: [], provides: [intent_classifier, qna_intent_classifier]

Classifies against the QnA intents in the bot

pipeline_steps:
- step: qna_intent_classifier

Enriches the message with:

intent — %Bubble.Intent{} struct which is filled when a QnA intent is classified.

LLM-based intent classifier for global intents¶

Requires: [], provides: [:llm_intent_classifier, :intent_classifier]

Uses ChatGPT to classify LLM intents

pipeline_steps:
- step: llm_intent_classifier

Enriches the message with:

intent — %Bubble.Intent{} struct which is filled when an intent is classified using the LLM.

GPT Prompt

The prompt option specifies that a custom GPT prompt (from a prompt yaml file) gets used for rendering the classifier prompt. By default, it uses a builtin prompt, that looks like the following:

prompts:
  - id: classify_llm_intents
    label: Classify LLM intents
    text: |
      system: Classify the last user message to one of the following KEYWORDS: {{ intent_ids }}. When unsure, reply with: unknown
      {% for i in intents %}
      {% if i.description %}When {{ i.description }} reply: {{ i.id }}{% endif %}
      {% endfor %}
      ---
      assistant: {{ question }}
      user: {{ text }}

While rendering the prompt, the following variables are available:

intent_ids - The IDs of all LLM intents as a single comma separated string

question - The last question asked by the bot

text - The user utterance

intents - All intents as a list. Each intent contains its own utterances as well.

LLM Intent classifier for local intents¶

Requires: [:in_context], provides: [:local_intent_classifier]

Uses ChatGPT to classify the given local intents in the ask context.

context_pipeline_steps:
- step: gpt_local_intent_classifier
  options:
    prompt: my_prompt_name

Enriches the message with:

intent — %Bubble.Intent{} struct which is filled when an intent is classified using GPT.

LLM Prompt

The prompt option specifies that a custom LLM prompt (from a prompt yaml file) gets used for rendering the classifier prompt. By default, it uses a builtin prompt, that looks like the following:

prompts:
  - id: classify_local_intents
    label: Classify local intents
    text: |
      system: Classify the last user message to one of the following KEYWORDS: {{ intent_ids }}. When unsure, reply with: unknown
      {% for i in intents %}
      {% if i.description %}When {{ i.description }} reply: {{ i.id }}{% endif %}
      {% endfor %}
      EXAMPLES:
      {% for i in intents %}{% for u in i.utterances %}
      'user: {{ u.text }}' 'assistant: {{ i.id }}'
      {% endfor %}{% endfor %}
      ---
      assistant: {{ question }}
      user: {{ text }}

While rendering the prompt, the following variables are available:

intent_ids - The IDs of the local intents as a single comma separated string

examples - The intent utterances formatted as a conversation between user and agent. Use the [[ examples ]] syntax to insert these in the prompt.

question - The last question asked by the bot

text - The user utterance

intents - All intents as a list. Each intent contains its own utterances as well.

Spacy-based sentence tokenizer¶

Requires: [], provides: [tokenizer, spacy_tokenizer, language_detector]

A tokenizer pipeline step which uses Spacy to tokenize the input, perform language detection, detect default entities and performs POS tagging.

pipeline_steps:
- step: spacy_tokenizer

Enriches the message with:

sents — the message tokenizations, where each token is annotated as its POS tag or entity. Mostly used internally by BML.
stats.detected_language — ISO code of the language that was detected from the message text.
stats.word_count — The number of word tokens in the text.
stats.token_count — The total number of tokens in the text.

LLM-based intent classifier for topic-based intents¶

Requires: [], provides: [:llm_topic_intent_classifier, :intent_classifier]

Uses ChatGPT to classify LLM intents

pipeline_steps:
- step: llm_topic_intent_classifier

Enriches the message with:

intent — %Bubble.Intent{} struct which is filled when an intent is classified using the LLM.

GPT Prompt

The prompt option specifies that a custom GPT prompt (from a prompt yaml file) gets used for rendering the classifier prompt. By default, it uses a builtin prompt, that looks like the following:

prompts:
  - id: classify_llm_topic_intents
    label: Classify LLM topic intents
    text: |
      system: Classify the last user message to one of the following KEYWORDS: {{ intent_ids }}. When unsure, reply with: unknown
      {% for i in intents %}
      {% if i.description %}When {{ i.description }} reply: {{ i.id }}{% endif %}
      {% endfor %}
      ---
      assistant: {{ question }}
      user: {{ text }}

While rendering the prompt, the following variables are available:

intent_ids - The IDs of all LLM intents as a single comma separated string

question - The last question asked by the bot

text - The user utterance

intents - All intents as a list. Each intent contains its own utterances as well.

BML intent matcher¶

Requires: [tokenizer], provides: [intent_classifier, bml_classifier]

Processes the bot's intents via 'hard matches'.

Intents are classified based on the tokenized sentence using the BML expressions from the match: attribute of the intent and from a literal match on the intent label. This allows for short-circuiting of intent classification for skipping the more expensive (in terms of processing time) QnA and Dialogflow classifiers.

The input message needs to be tokenized by a previous pipeline step before this step can be added to the pipeline.

pipeline_steps:
- step: naive_tokenizer
- step: bml_intent_matcher

Enriches the message with:

intent — %Bubble.Intent{} struct which is filled when a hard BML or label match is found.

Naive sentence tokenizer¶

Requires: [], provides: [tokenizer, naive_tokenizer]

A naive tokenizer which uses a whitespace-based strategy to split the input string in sentences and words. Can be used in place of the spacy tokenizer for more faster processing.

The tokenizer does not do any POS taging or entity recognition, so BML expressions that use those will never match when this tokenizer is used.

pipeline_steps:
- step: naive_tokenizer

Enriches the message with:

sents — the message tokenizations. Mostly used internally by BML.
stats.word_count — The number of word tokens in the text.
stats.token_count — The total number of tokens in the text.

Bubblescript task pipeline step¶

Requires: (configurable), provides: (configurable)

An experimental pipeline step which allows you to execute a Bubblescript task.

pipeline_steps:
- step: bubblescript
  options:
    perform: nlp_redact
    requires: [tokenizer]
    provides: language_detector

With accompanying Bubblescript task:

task nlp_redact do
  message.text = replace(message.text, "shit", "****")
end

The task name must be given to the perform option. The first task that matches the guard will be executed, the other tasks with the same name will be ignored.

The requires option specifies which requirement(s) need to be fulfilled in earlier pipeline steps, for instance, a tokenizer might be required. The provides option can state one or more symbolic names that are being provided by this bubblescript step, for instance, a language detector.

Task execution

The following global variables are set in the interpreter: message and context. The execution context does NOT contain the usual user, bot or other globals.

The message and context that are given to the Bubblescript task are mutable: like the above example shows, their contents can be changed for use in subsequent pipeline steps.

In the task, set the variable nlp_pipeline_halt to true to prevent the execution of the rest of the pipeline.

Pattern-based ignore¶

Requires: [], provides: []

Skips the rest of the pipeline when the message matches a given regular expression.

pipeline_steps:
- step: pattern_ignore
  options:
    pattern: "^."