NVIDIA Jarvis Virtual Assistant With Rasa
And Why It Is More Accessible Than What You Think
Introduction
In order to develop a virtual assistant with a speech interface, four key elements are required. The first being Speech Recognition, also referred to Automatic speech recognition or Speech-To-Text. Hence transcribing the user
speech into text. This is the input user touch point.
The second is the output to the user touch point. Being the conversion of text into speech. And preferably natural sounding speech. This is also referred to as Test-To-Speech or Speech Synthesis.
These two elements needs to have low latency, preferably less than 300 milliseconds. It also requires to be trained.
The remaining two elements are synonymous with text based conversational agents; dialog management and Natural Language Understanding. Rasa is the avant-garde when it comes to these two elements.
The second configuration for the Jarvis & Rasa demo application is where the Natural Language Processing is performed by Jarvis.
Currently ASR, NLU and TTS models are available in NVIDIA Jarvis. Trained on thousands of hours of speech data.
On the roadmap of Jarvis are other cognitive elements like computer vision. The vision component includes lip activity, gaze detection, gesture detection and more.
I first heard about this from Fjord Design & Innovation where they referred to some of these elements as a phenomena called face speed.
Face Speed is the cues and hints we pick up from gestures, facial expressions and lip activity.
By incorporating these elements in their roadmap, Jarvis is poised to become a true conversational agent, taking cues from the speaker’s appearance.
What makes this collaboration between NVIDA and Rasa so compelling is that it is the combination of two technological environments who needs each other as much as they compliment each other.
This is an avenue to speech enable a Rasa digital assistant.
Environment Setup
In the Medium article I wrote on getting started with your NVIDIA Jarvis environment you will find a step-by-step guide to setup a Virtual Machine Instance using AWS EC2. Cost is always a consideration if you are just experimenting, especially if you are charged in a weaker currency.
The EC2 instance can also be started and stopped in order to save on costs.
SSH Tunnels work wonders in accessing URL’s on the VM, latency is a problem when testing the conversational agent in voice.
Why Rasa?
Rasa is a complete chatbot framework solution for any implementation where the user input is not voice. Hence text input, which includes conversational components like buttons, links etc.
It needs to be noted that from a Conversational AI perspective Rasa has all the features and elements required.
Elements contributing to Rasa being a good option for the NVIDIA Jarvis environment:
- Free to download and use.
- Contained and complete chatbot framework.
- Open architecture for integration.
- Install anywhere.
The addition Rasa requires to be speech enabled are:
- Automatic Speech Recognition (aka Speech-To-Text)
- Speech Synthesis (aka Text-To-Speech)
I will be remiss not to mention that the NLP capability of Jarvis is significant and hence the two architectural approaches mentioned at the start. It need not be a choice between the NLU/P of Jarvis or Rasa. The two can be used in conjunction and complimenting each-other.
The basic sequence of events her shows how the power of Jarvis NLP and Rasa’s NLU capability can be leveraged, especially for longer input.
One last thought on why Rasa, Rasa is currently the only industrial strength conversational framework which employs machine learning for their dialog management; what is currently in most cases a state machine on other systems. With Rasa’s vision of deprecating intent classification and also the dialog (or bot script), the flexibility matches the vision of Jarvis.
Running The Demo
To run the demo and also validate your installation, follow the step-by-step instructions found here. There are two modes to run the conversational agent, one is with Rasa NLU, and the other with Jarvis NLP.
The conversational agent is served on https://[jarvis chatbot server host IP]:5555/JarvisWeather, and does look like a slimmed down version of what you see in the official demo videos.
Above is an example of some small talk with the conversational agent. The demo instructions provide guidelines for a test dialog.
To run the weather bot, be sure to add the Weather API key to your Jarvis configuration. I had trouble with the Rasa Weather action extracting the key, so I hard coded it in the action.
(rasa) root@156ggcbd3bg9:/workspace/samples/rasa-chatbot/rasa-weatherbot/actions# vim weather.py
You will also need to setup the network configuration for the demo to work. There are two locations in the code base that have to be configured for inter-service communication:
rasa-chatbot/rasa-weatherbot/endpoints.yml
and…
rasa-chatbot/config.py
Accessing the conversational agent via a browser on my machine is enabled with a SSH tunnel setup to port 5555 on the AMI.
Conclusion
NVIDIA Jarvis has an ambitious roadmap to become an imbedded voice assistant with speech and visual capabilities. Justice will not be done to the abilities of Jarvis via a medium like a phone call. But rather imbedded in an application on a phone, smart devices or smart home with audio and vision.
As mentioned, the Jarvis NLP callabilities are astute and the state management can be facilitated within Jarvis. Integration to existing text base digital assistants will stand Jarvis in good stead.
Read this article in Spanish here.