Fundamentals of Chatbot Information Extraction & Visualization
Maximize Your Information Extraction From User Input
Information extraction is the task of automatically extracting information from data in sources like conversations or documents. This data can be structured or highly unstructured. Unstructured data is often the case of processing human language texts by means of Natural Language Processing (NLP). These conversations can also be embedded in audio, video, documents and live agent chat.
The vision is to extract information from unstructured data. But more than this, to allow for logical reasoning to derive inferences based on the logical content of the unstructured data or conversations.
The bigger vision is to devise automatic methods to manage text. We are thinking here beyond transmission, storage and display; but structuring the data, understanding the relationships between words, emotion, intent and meaning.
Tim Berners-Lee refers to the internet as a web of documents. Hence we have a growing amount of information, but in a highly unstructured format and in natural, human language.
Unstructured data is information with no predefined data model and is not formally organized. In general unstructured data is text heavy and contains entities like dates, numbers, and facts. There are irregularities and ambiguities which negates the implementation of traditional software. The importance of this is underlined by a statement from 1998; Merrill Lynch cited rule of thumb that in the vicinity of 80–90% of all potentially usable business information may originate from unstructured form.
Unstructured data in the form of conversational text data do have structure, we as humans can understand and interpret the data while we read it. We can instantly derive meaning, intent, relationships and entities from the text by glancing over it. The challenge for an automated process like a computer program is that the structure or lack thereof is unanticipated and unannounced. Also, often the human language of choice is announced.
Hence we need to employ methods like NLP to find patterns and interpret the information.
We are imposing structure upon the unstructured data contained within the text; this means that there will be areas where our template or imposed structure will not capture or perfectly fit the underlying message.
This structure we try and enforce need to be represented, and visualization of this is powerful.
In the process of information extraction, named entities are real-world objects; nouns. These can include persons, locations, organizations, products and the like. It can be denoted with a name, it can be abstract or have a physical presence. Named entities can include the following; “Ronald Reagan”, “Amsterdam”, “Porsche”, “Microsoft”; really anything which can be named.
A Named Entity can also be seen as an entity instance; Amsterdam is an instance of a city. Porsche is an instance of a vehicle and so on.
The best way to introduce entities is to give an example, the sentence “Madrid is a city in Spain” the idea is that “Madrid” is a city, and not a name of a person or any other entity that could be referred to as “Madrid”. Because, it is referred to as a city in Spain.
Hence the target knowledge base depends on the intended application. We see the theme which emerge is the idea of linking entities to establish some kind of context and improve the performance of information retrieval systems.
Impediments to Entity Linking
Variations in Name: take city names for instance…some NLP systems expect a city name to be a single word. Hence a name like “Cape Town” poses a problem.
There might be different spellings for the same entity and for different languages.
Ambiguity: this is also something we as humans struggle with…does the word “web” refer to a spider’s web or the internet. Does the word “kiosk” refer to a small shop or to a digital self-help station…
Then there are the challenges of multiple languages, information which is evolving and changing with different meanings and context and the framework of analysis not being able to scale at speed.
Disambiguation refers to the removal of ambiguity by creating clarity. User input can be ambiguous and cryptic. The Conversational Interface or chatbot needs to disambiguate the user input, to collect the relevant informational pieces accurately.
However, it is important that the chatbot does not try and disambiguate user input which is not cryptic and ambiguous in reality. This is an example of where the model is not strong or good enough to organize and create structure from the unstructured conversational data.
Hence frustrating the user and forcing to user to input data in a special format for the interface to consume. Hence creating structure on the user’s side and removing the beauty from the conversational interface.
Visualization can be incredibly helpful in speeding up development and debugging your code and training process.
The dependency visualizer,
dep, shows part-of-speech tags and syntactic dependencies.
The entity visualizer,
ent, highlights named entities and their labels in a text.
And some styling is possible…
Rendering several large documents on one page can easily become confusing. To add a headline to each visualization, you can add a
title to its
user_data. User data is never touched or modified by spaCy.