Write a Natural Language Processing API to Analyze Any Text
There is this commonly held belief that when it comes to Natural Language Processing (NLP) you are at the mercy of the giants of cloud computing. These giants being IBM, AWS, Microsoft and Google; to name a few.
The good news is, you are not!
Another challenge when it comes to NLP is that often organisations do not want their data to cross country bonders, or vest in some commercial cloud environment where it is hard to enforce laws pertaining to the protection of person information. Or even know where your data resided.
What is Conversational AI?
What is this thing known to us as Natural Language Processing, Natural Language Understanding or Conversational AI?
Conversational AI can be seen as the process of automating communication and creating a personalized customer experiences at scale. It is the process of receiving input data in the form of unstructured conversational data. Then in turn structuring this data, to derive meaning, intent and entities from it.
This process can be real-time, while a conversation is in process, or it could be a bulk, after-the-fact process where text based communications are analyzed. This text based communications include emails, saved customer conversations, live agent chats etc.
Our Tools Of Choice
It is written in the programming languages Python and Cython. The library is published under the MIT license and currently offers statistical neural network models for English, German, Spanish, Portuguese, French, Italian, Dutch and multi-language NER, as well as tokenization for various other languages.
spaCy focuses on providing software for production usage.
Among others, there are these two environments in which you can install and run spaCy. The one is Anaconda.
Anaconda is ideal to to create one or multiple virtual environments on your machine. You can activate and deactivate these virtual environments, hence maintaining these environments in parallel.
Should you want to delete one of these environments, it is much easier than uninstalling a whole list of applications which might create system instability. Once you have installed Anaconda, you can create a Python environment with the following command:
conda create -n spacy python=3.5
I named my new environment “spacy” and installed Python. If you do not include Python, PIP will not be available for subsequent installs in you environment.
To access your new environment:
conda activate spacy
From here you can install spaCy with the following commands:
pip install spacy
python -m spacy download en
But, for this exercise I decided to not use Anaconda, and switch over to a Jupyter Notebook.
Jupyter Notebook is a web-based interactive computational environment for creating Jupyter notebook documents. The “notebook” term can colloquially make reference to many different entities, mainly the Jupyter web application, Jupyter Python web server, or Jupyter document format depending on context.
A Jupyter Notebook document is a JSON document, following a versioned schema, and containing an ordered list of input/output cells which can contain code, text (using Markdown), mathematics, plots and rich media, usually ending with the “.ipynb” extension.
A Jupyter Notebook can be converted to a number of open standard output formats (HTML, presentation slides, LaTeX, PDF, ReStructuredText, Markdown, Python) through “Download As” in the web interface, via the nbconvert library or “jupyter nbconvert” command line interface in a shell.
To simplify visualisation of Jupyter notebook documents on the web, the nbconvert library is provided as a service through NbViewer which can take a URL to any publicly available notebook document, convert it to HTML on the fly and display it to the user.
So once you have created your notebook in a browser (I used Chrome), run the following command:
!pip install spacy
!python -m spacy download en
As simple as that! Now we are ready to start with the basics of our NLP API…
Within NLP, what is tokenization? Tokenization is the process of splitting a string or text into a list of tokens. One can think of a token as parts like a word is a token in a sentence, and a sentence is a token in a paragraph.
This is a screen-print of a Jupyter Notebook, you can see the few lines of and the sentence we want to tokenize. From the outbut, you can see how the following words are broken down: shouldn’t, aren’t and you’ve.
Below is a default list of spaCy stopwords with 326 entries, and each entry is a single word. It should be clear to us why these words are not useful for data analysis. In most cases these words do not assist us in understanding the basic meaning of a sentence. Or, these words can be to vague to use in a NLP process.
Below is our sentence:
When you try and learn something new, you shouldn’t Only read about it!
Weren’t not the same, reading aren’t always the best, Try doing something you’ve not done before!
But with stop words removed. The few words remaining gives us a general idea what the sentence or conversation is about. In a chatbot environment this can be useful in the attempt to try and derive an intent from the users input.
In essence it is a list of most common words of a language that are often useful to filter out, and this could form part of an initial NLP high-pass where the user input is longer at it is hard to derive meaning and intent.
spaCy includes a build-in option with which a single word can be broken down into its lemma, hence lemmatization. Below is an example, using .lemma_ to produce the lemma for each word listed in the phrase.
Lemmatization addresses the fact that words like think, thinks, thinking, thinker, thought are not exactly the same, but, they all have the same basic meaning: to think!
The differences in spelling have is a way of adjusting the word for our spoken language, but for machine processing, those changes can create aberrations.
We are looking at the word as described in a dictionary.
Part Of Speech
Each word has a function within a sentence. A noun defines an entity… an object. The adjective has the function of describing the object. An action is defined by a verb.
In the figure above is a POS tagging done with spaCy.
Then spaCy correctly identified the part of speech for each word word in the sentence, up to punctuation. This assists with augmenting understanding input sentences and context.
This is a very important section if you are interested in using spaCy for chatbot implementations. The functions mentioned prior to Entity Detection are very useful for a higher-order first-pass NLP layer. Perhaps to do a basic structuring of the user input before sending it off into your chatbot’s NLU.
But what makes entities useful is that it can be used to extract nouns from the conversation; be those values names, organisations, money, date etc.
There is a spaCy function to visualize entities within the text. This is helpful if you want to post process text in a document and have a visual representation of the data within the text.
This should give you a good idea of how a NLP API can be developed using a simple tool like Flask web framework.
This is but a few functions, in subsequent stories will look at intents and entities in more detail.
Read More Here…