Photo by Carina Sze on Unsplash

Using Rasa For Chatbot Development In Any Vernacular Language

Regional Languages Should Not Be An Impediment For Conversational Interface Development

Cobus Greyling
5 min readAug 24, 2020

--

Introduction

When venturing into the field of chatbots and Conversational AI, the process starts with a search of what frameworks are available. This usually leads you to one of the big cloud Chatbot service providers.

IBM Watson Assistant — Language Options

Most probably you will end up using IBM Watson Assistant, Microsoft LUIS/Bot Framework, Dialog Flow etc. There are advantages…these environments offer easy entry in terms of cost and a low-code or no-code approach.

Microsoft LUIS Language Options

However, one big impediment you often run into with these environments, is the lack of diversity when it comes to languages.

Minority Languages

Many conversational interfaces need to exist and be used in geographic areas where there are numerous smaller languages. In Africa alone, there are seven major language families. With he total number of languages natively spoken in Africa is estimated to be between 1,250 and 2,100.

By some counts at more than 3,000. This depends on what you see as a language or a dialect.

Rasa Prototype From May 2019 ~ Multiple Intents & Language Independent in the Afrikaans Language.

Making provision for these languages is not feasible or viable for the large cloud providers; even though highly desirable by the these markets.

These geographic areas are often in dire need of access to information at a very low cost. Low cost often means asynchronous communication like chatbots via text or SMS.

When considering creating a Conversational UI in vernacular, the assumption is made that you require:

  • Massive Computing Power
  • Masses of Training Data and
  • Very Specialized Knowledge.

Rasa has solved for all three of these impediments.

Rasa & Vernacular

You have to define your chatbot output in the specific language. Hence the dialog your chatbot will return to the user, to facilitate the conversation.

You also have to define the user in input. This is done by creating intents, with 15 to 20 example user utterances each. Within these utterances you can define your entities.

## intent:travel_details- I want to travel by [train](travel_mode) from [Berlin](from_city) to [Stuttgart](to_city) on [Friday](date_time)

An intent called travel_details with one user utterance example.

Develop For Any Language

With Rasa’s flexible pipeline, you are not mandated to use a specific language and you can train your model to be more domain specific. Most chatbots address a very narrow domain in any-case.

If there are no word embeddings for your language or you have very domain specific chatbot, Rasa recommends the following pipeline:

language: "fr"  # your two-letter language code

pipeline:
- name: WhitespaceTokenizer
- name: RegexFeaturizer
- name: LexicalSyntacticFeaturizer
- name: CountVectorsFeaturizer
- name: CountVectorsFeaturizer
analyzer: "char_wb"
min_ngram: 1
max_ngram: 4
- name: DIETClassifier
epochs: 100
- name: EntitySynonymMapper
- name: ResponseSelector
epochs: 100

The good news is that no in-depth knowledge of the pipeline is required, and it can merely be copied into your project.

Obviously there are components for:

  • Tokenization

Tokenization is the process of splitting a utterances (phrase, sentence, paragraph, documents) into smaller units; single words. Each of these units are referred to as tokens.

  • Entity Extraction
  • Intent Classification
  • Response Selection
  • etc.

You can read more of these components here.

Getting Started

  1. Below is the installation guide for Windows 10, which is comprehensive and should see you through for complete and successful installations. There are also tutorials for other operating systems. Here you will also see how it initialize your first project.
  2. Inside your new project folder, navigate to the config.yml and paste your pipeline into this file. It is always wise to make a backup of the default configuration. or any file you change for that matter.
  3. Your user language/vernacular specific data will be primarily defined in two area; the input and output of the chatbot. Rasa will have to know what your user is saying by means of recognizing intents and identifying entities. Thus you will have to define your training examples your vernacular/language of choice. You will capture it the following files…
  • Input: Your user’s input will be interpreted via the intents and entities defined in this file:
/data/nlu.md
  • Output: The portion where the bot speaks back to the user will also have to be in response strings for the things your assistant can say.
domain.yml

Here is a list of files created when you initialize a project.

Language Requirements must be defined in the NLU.md file for input and domain.yml for output.

This video has everything you need to install Rasa on Windows 10:

Install Rasa Open Source On Windows 10

Getting to grips with the basics is covered on this page:

https://rasa.com/docs/rasa/user-guide/rasa-tutorial

And finally, the ten episode Rasa Masterclass will take you through all the required components for a resilient AI Assistant.

The Rasa Masterclass has everything you need to get you started.

Conclusion

The more astute Conversational AI environments are mastering the art of adding functionality, and by implication complexity, to their platform; while simultaneously simplifying the user interface.

This is evident with Rasa and only a few other platforms.

--

--

Cobus Greyling

I explore and write about all things at the intersection of AI & language; LLMs/NLP/NLU, Chat/Voicebots, CCAI. www.cobusgreyling.com