How Might NVIDIA Jarvis Dialog Management Look?

And What Approach Will Play To The Strengths Of Jarvis?

Cobus Greyling

10 min readMar 9, 2021

Introduction

Recently NVIDIA released Jarvis, which is a is described as an application framework for Multimodal Conversational AI.

Jarvis a high performance conversational AI solution incorporating speech and visual cues; often referred to as face-speed. Face-speed includes gaze detection, lip activity etc.

The multimodal aspect of Jarvis is best understood in the context of where NVIDIA wants to take Jarvis in terms of functionality.

Future functionality includes:

ASR (Automatic Speech Recognition)
STT (Speech To Text)
NLU (Natural Language Understanding)
Gesture Recognition
Lip Activity Detection
Object Detection
Gaze Detection
Sentiment Detection

Again, what is exciting about this collection of functionality, is that Jarvis is poised to become a true Conversational Agent. But why do I say this?

NVIDIA Jarvis Weather demo chatbot deployed on a AMI and Accessed via SSH Tunnel

We as humans communicate not only in voice, but also by detecting the gaze of the speaker, lip activity, facial expressions etc.

Another key focus are of Jarvis is transfer learning. There is significant cost saving when it comes to taking the advanced base models of Jarvis and repurposing them for specific uses.

The functionality which is currently available in Jarvis 1.0 Beta includes:

Automatic Speech Recognition (ASR/STT),
Speech Synthesis (TSS)and
Natural Language Processing & Understanding (NLU)

Jarvis Dialog Management

Going through the documentation and demo application, there is no doubt about the ASR, TSS and NLU capability of Jarvis. Considering elements like transfer learning and speed of training.

However, one element currently not clearly defined is how the Dialog Manager (DM) will be implemented and how the dialog development interface will work.

So, as of yet, NVIDIA Jarvis Dialog Development and Management feature is under development and has not been released yet.

The Virtual Assistant (with Rasa) sample app (https://docs.nvidia.com/deeplearning/jarvis/user-guide/docs/samples/rasa.html) shows how Jarvis ASR, TTS, & NLU can be used with the Rasa Dialog Manager.

Dialog Management, also referred to as state management or conversation flow management will be different for Jarvis than for other conversational environments.

The reason for this is that Jarvis is a multimodal conversational interface which will not only have speech/text input. But also speaker identification, gaze detection. Also know as look-to-talk, hence if a addresses Jarvis and not a passenger for example, and lip activity.

Hence multimodal refers to multi-user and also multi-context and multi-medium input. This implies that the Jarvis DM environment will have to be more complex than traditional text and voice conversational interfaces.

Looking at the Jarvis tools currently available, the dialog management environment will most probably not be graphic. But rather a text configuration approach such like a YML file. Absorbing the multi-modal nature of Jarvis in a simplistic design and development structure will be a challenge.

As mentioned, the multimodality of Jarvis will necessitate a more complex tool to match the input demands.

As a reference, let’s take a look at how the current conversational agent landscape looks in terms of Dialog Development and Management…

(At this stage, you might want to skip forward to the conclusion.) 🙂

NVIDIA Jarvis Virtual Assistant With Rasa

And Why It Is More Accessible Than What You Think

cobusgreyling.medium.com

A First Look At The NVIDIA Jarvis Demo Applications

And What Is Available In Jupyter Notebooks

cobusgreyling.medium.com

Current Dialog Development & Management Landscape

When different vendors and platforms converge on the same basic approach & principles, it is safe to assume it is the most efficient way of doing it.

Looking at the large and emerging chatbot platforms, they all converge on two key aspects:

Intents
Entities

There are elements like form or slot filling, policies etc. which is also important. But for the purpose of this story, we will not focus on it.

Intents

Intents are usually defined with by a description and a few training or utterance examples.

Rasa Chatbot Framework: intent example with contextual entities defined.

The utterance examples are what a user is anticipated to say or state. The intent they expect to have fulfilled.

Some environments have additional features like intent conflict identification and utterance suggestions.

Entities

Entities are the nouns the user enters, often multiple entities compounded in one user utterance.

Contextual Entity Annotation In IBM Watson Assistant

The challenge here is extract the entities when first uttered by the user.

Most environments can extract compound entities and also contextual entities.

So, compound and contextual entities are being implemented by more chatbot platforms.

The option to contextually annotate entities are also on the rise. Often entities have a finite set of values which are defined. Then there are entities which cannot be represented by a finite list; like cities in the world or names, or addresses.

These entity types have too many variations to be listed individually.

Dialog Creation: Development & Management

The area where there is a divergence rather than a convergence is that of Dialog Development and Management.

Meaning that for some the verdict is still out on what approach is ideal…

Dialog Creation

The dialog component has the responsibility of managing the state of the conversation, turn-by-turn. This is also referred to as the state machine, dialog flow (in generic terms and not Google’s platform), conversation design or dialog management.

It needs to be stated some environments have a clear separation between their NLU component and the core or dialog components. Rasa, Microsoft and AWS fall in this category.

IBM Watson Assistant has the opposite approach where the two are really merged.

Below I look at four distinct approaches currently followed in the chatbot marketplace to dialog creation and management.

1. Design Canvas

A design canvas environment is part and parcel of the new Bot Society design environment.

In the Botsociety environment designs can be deployed to solutions like Rasa, Microsoft Bot Framework and more.

It is evident that Botsociety is becoming more technical nature and complex.

Cleary the product is in the process of morphing from a design, presentation and prototype only tool, into a conversation development tool.

You can choose to export your design to:

Bot Framework / Azure
Dialogflow
Rasa.ai
Or to own codebase (API)

Dialogflow CX also leverages this canvas approach where you can map out a complex conversation and expand pages or conversational nodes.

Google Dialogflow CX: Dialog development interface where the conversation state is managed.

This approach is easy to kick off a chatbot project, and designers feel comfortable initially.

But as complexity grows, scaling and management are impacted.

Advantageous of this approach are:

Ease of collaboration
Panning and viewing of the design
Zoom in and out to see more or less detail
Combining of the design and development process.
Suitable for quick prototyping & cocreation.

Disadvantageousness of this approach are:

Complexity of large implementations
Change management and impact assessments
Troubleshooting and identifying conversation break points.
Multiple conditions per dialog node which are impacted when parameters change.

2. Dialog Configuration

You might ask what is the difference between a design canvas and dialog configuration…

Dialog configuration is an approach where you don’t quite have a canvas to design on, but conversational nodes are defined graphically.

These design nodes are in a linear fashion and the development environment is more rigid and sequential.

Within each dialog node conditions are set, and the conversation can skip up or down within the sequence, which can lead to confusion.

IBM Watson Assistant follows this design principle.

For each dialog node conditions can be set and certain outcomes defined. Dialogflow ES also reminds of a more dialog configuration approach together with Microsoft Composer. Microsoft Power Virtual Agents are also based on a dialog configuration approach.

Microsoft Bot Framework Composer: Flow Design Interface

Advantageous of this approach are:

Slightly more condensed presentation of the conversation
Restrictive nature prohibits impulsive changes.
More technical in nature with varying levels of configuration.
Suitable for quick prototyping.

Disadvantageousness of this approach are:

Difficult to present and perform walk-through
For larger conversations there is mounting complexity and cross-referencing.
Mindfulness of how parameter and settings changes will cascade.
Not suited as a conversation design tool.

3. Native Code

Native code makes the case for a highly flexible and scalable environment. Solutions which come to mind in this category is Amazon Lex, and to some extend Alexa skills.

But especially Microsoft Bot Framework running on native code. The advantage here is that non-propriety code can be used. In the case of Microsoft Bot Framework C# or Node.js can be used. In the case of Lex or Alexa skills; Lambda functions, you will most probably use Node.js or Python.

Native code affords you much more agility and flexibility. Although there is a chasm between design and implementation. Here a shared understanding needs to be established and cocreation is inhibited.

There are also solutions which use propriety code for the dialog, such as Oracle with their BotML.

Advantageous of this approach are:

Non-propriety in terms of development environment language.
Flexible and accommodating to change in scaling (in principle)
Non-dedicated, specific skills or specific knowledge required.
Porting of code, or even re-use.

Disadvantageousness of this approach are:

Design and implementation is far removed from each other.
Design interpretation might be a challenge.
Another, most probably dedicated, design tool will be required.
The complexity of managing different permutations in the dialog still needs to exist; within the code.

4. ML Stories

Here Rasa finds itself alone in this category; invented and pioneered by them. Where they apply ML, and the framework calculates the probable next conversational node from a basis of user stories.

stories:
- story: collect restaurant booking info  # name of the story - just for debugging
  steps:
  - intent: greet                         # user message with no entities
  - action: utter_ask_howcanhelp
  - intent: inform                        # user message with no entities
    entities:
    - location: "rome"
    - price: "cheap"
  - action: utter_on_it                  # action that the bot should execute
  - action: utter_ask_cuisine
  - intent: inform
    entities:
    - cuisine: "spanish"
  - action: utter_ask_num_people

Stories example from: #https://rasa.com/docs/rasa/stories. 👆

Rasa’s approach seems quite counter intuitive…instead of defining conditions and rules for each node, the chatbot is presented with real conversations. The chatbot then learns from these conversational sequences, to manage future conversations.

These different conversations, referred to as Rasa Stories, are the training data employed for creating the dialog management models.

Slots and forms can be incorporated…and the idea if CDD (Conversation-Driven Development) underpins continuous improvement of the models.

Rasa-X: Conversation-Driven Design Interface

Advantageous of this approach are:

Everyone knows the state machine needs to be deprecated; this achieves that.
Training time is reasonable.
No dedicated or specific hardware required.
No dedicated ML experts and data scientists required…AI for the masses.
Complexity is hidden in presented in a simplistic way.

Disadvantageousness of this approach are:

This approach may seem abstract and intangible to some.
Apprehension in instances where mandatory data needs to be collected. Or where legislation dictates conditions. However, here Form Policies comes into play.

Getting Started With NVIDIA Jarvis For Conversational AI Services

NVIDIA Jarvis Is An Application Framework For Conversational AI

cobusgreyling.medium.com

Conclusion

Jarvis is bringing deep learning to the masses. Considering the processing power and optimization of Jarvis with the NVIDIA GPU based on their Turing or Volta architecture, a machine learning approach seems more plausible. Hence a scenario where probable conversational paths are defined. Elements which must be incorporated in the dialog management environment are digression, disambiguation and negating fallback proliferation.

In this configuration Rasa is utilized for dialog management and NLU.

Considering that dialog turns can be initiated with gestures, a gaze; non-definitive input.

However, the non speech or text cues or non-explicit input will pose a challenge and will definitely add complexity.

Transfer learning could be utilized for certain industries, and a dialog state starter pack could be made available, for banking, insurance, telecommunications, medical care etc. for instance.

Jarvis will be a living services. Living in the user’s environment and surfacing via different devices and environments (car, home, office, phone etc.). This leads to a phenomenon know as ambient orchestration. Where patterns are detected in user behavior, and specific information is surfaced and the right time and the right place and device. Hence these living services being orchestrated following the user touchpoints.

This, coupled with multimodal input, will lean heavily towards a ML approach.

Subscribe to my newsletter.

NLP/NLU, Chatbots, Voice, Conversational UI/UX, CX Designer, Developer, Ubiquitous User Interfaces, Ambient…

cobusgreyling.me

Cobus Greyling - Medium

Read writing from Cobus Greyling on Medium. NLP/NLU, Chatbots, Voice, Conversational UI/UX, CX Designer, Developer…

cobusgreyling.medium.com

NVIDIA Jarvis

NVIDIA JARVIS NVIDIA Jarvis is an application framework for multimodal conversational AI services that delivers…

developer.nvidia.com

NVIDIA: World Leader in Artificial Intelligence Computing

NVIDIA, inventor of the GPU, which creates interactive graphics on laptops, workstations, mobile devices, notebooks…

www.nvidia.com

How Might NVIDIA Jarvis Dialog Management Look?

And What Approach Will Play To The Strengths Of Jarvis?

Introduction

Jarvis Dialog Management

NVIDIA Jarvis Virtual Assistant With Rasa

And Why It Is More Accessible Than What You Think

A First Look At The NVIDIA Jarvis Demo Applications

And What Is Available In Jupyter Notebooks

Current Dialog Development & Management Landscape

Intents

Entities

Dialog Creation: Development & Management

Dialog Creation

1. Design Canvas

2. Dialog Configuration

3. Native Code

4. ML Stories

Getting Started With NVIDIA Jarvis For Conversational AI Services

NVIDIA Jarvis Is An Application Framework For Conversational AI

Conclusion

Subscribe to my newsletter.

NLP/NLU, Chatbots, Voice, Conversational UI/UX, CX Designer, Developer, Ubiquitous User Interfaces, Ambient…

Cobus Greyling - Medium

Read writing from Cobus Greyling on Medium. NLP/NLU, Chatbots, Voice, Conversational UI/UX, CX Designer, Developer…

NVIDIA Jarvis

NVIDIA JARVIS NVIDIA Jarvis is an application framework for multimodal conversational AI services that delivers…

NVIDIA: World Leader in Artificial Intelligence Computing

NVIDIA, inventor of the GPU, which creates interactive graphics on laptops, workstations, mobile devices, notebooks…

Written by Cobus Greyling