Using spaCy In Your Chatbot For Natural Language Processing
And It’s Easier Than What You Think…
Introduction
spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python.
Natural Language Processing and Understanding can be light weight and easy to implement. It is within anyone’s grasp to create some Python code to process natural language input, and expose it as an API.
spaCy is NLP tool and not a chatbot development framework.
It does not cater for dialog scripts, NLG, dialog state management etc.
But, what makes spaCy all the more interesting is that it can be implemented as a language processing API assisting an existing chatbot implementation. Especially in instances where users submit a longer input, the chatbot will do good to work with only a specific span or tokens from the utterance.
Also, it can be used for offline post-processing of user conversations.
Positives
- Quick and easy to start prototyping with.
- Excellent documentation and tutorials
- Custom models can be trained.
- Good resource to serve as an introduction to NLP.
- Good avenue to familiarize yourself with basic NLP concepts.
Considerations
- Large sets of data arerequired for training custom models.
- Complex implementation can become very technical.
- Minor languages might pose a challenge.
There is this commonly held belief that when it comes to Natural Language Processing (NLP) you are at the mercy of the giants of cloud computing. These giants being IBM, AWS, Microsoft and Google; to name a few.
The good news is, you are not!
Another challenge when it comes to NLP is that often organizations do not want their data to cross country bonders, or vest in some commercial cloud environment where it is hard to enforce laws pertaining to the protection of person information.
What is Conversational AI?
What is this thing known to us as Natural Language Processing, Natural Language Understanding or Conversational AI?
Conversational AI can be seen as the process of automating communication and creating a personalized customer experiences at scale.
- It is the process of receiving input data in the form of unstructured conversational data.
- Then in turn structuring this data, to derive meaning, intent and entities from it.
This process can be real-time, while a conversation is in process, or it could be a bulk, after-the-fact process where text based communications are analyzed. This text based communications include emails, saved customer conversations, live agent chats etc.
Entities
Within spaCy, a model can learn words which are entities. You can teach each model new entities in similar context even if the entities were not in the training data.
While spaCy comes with a range of pre-trained models to predict linguistic annotations, you almost always want to fine-tune them with more examples.
You can do this by training them with more labelled data.
Practical Implementations
Here are a few practical examples and implementations of spaCy making use of a Google Collab notebook.
In this example a nlp object is created and a sequence of tokens are assigned to doc. A doc is a container for accessing linguistic annotations.
A sequence is created that iterates over the tokens in the doc. Two checks are performed, firstly to see if the token resembles a number, secondly if the following token is a percentage size.
The flexibility of token.like_num can be seen in this example. And subsequently the symbol “%” is searched for. From this example it is clear how a high pass can be performed on text and basic language information can be extracted.
The example below iterate over the tokens and print out each and every token. The second column lists the part of speech; verb, pronoun, noun, number etc.
The third is the dependency, describing the relations between individual tokens, like subject or object.
The Matcher allows you to find words and phrases using rules. A pattern is a list of dictionaries added to the matcher.
In this more complex example the pattern is looking for iOS followed by any number.
Below is a more plausible production implementation where using lemma, any form of the word “download” is matched. But only if followed by a proper noun.
You will see in the last sentence, the word download is present, but is not followed by a proper noun.
Here a proper noun needs to be followed by a verb to be detected.
Only two sentences qualify for this match.
I particularly like this feature where a similarity score can be retrieved between two spans, two sequences of tokens.
This could be used by creating topics, and segmenting user input according to topics.
Creating a function to check if text has a number…
Appending text with the named entity label of Person, Organization, Geo Political Entity or Location to a Wikipedia URL.
Examples of named entities are:
- ORGANIZATION: IBM, Apple
- PERSON: Edward Snowden, Max Verstappen
- GPE: South Africa, Egypt
- etc.
In this example data is retrieved in JSON format from an URL and a doc object is created. When criteria is met from a set of patterns, an entity name of Gadget is assigned to it.
The phrase is returned, with the Entity name and the position of the entity.
Conclusion
From the few examples listed here it is clear that spaCy’s out-of-the-box processing capability is significant, especially for text classification and named entities. Complex rules can be created to to classify text and extract information.
When implementing more advance solution, the need for training data will add some complexity; with hundreds to thousands of examples. Also, for more complex implementations, the Python code will become more complex.