Photo by fran hogan on Unsplash

How Might NVIDIA Jarvis Dialog Management Look?

And What Approach Will Play To The Strengths Of Jarvis?

Cobus Greyling
10 min readMar 9, 2021



Recently NVIDIA released Jarvis, which is a is described as an application framework for Multimodal Conversational AI.

Conversational AI Skills

Jarvis a high performance conversational AI solution incorporating speech and visual cues; often referred to as face-speed. Face-speed includes gaze detection, lip activity etc.

The multimodal aspect of Jarvis is best understood in the context of where NVIDIA wants to take Jarvis in terms of functionality.

Future functionality includes:

  • ASR (Automatic Speech Recognition)
  • STT (Speech To Text)
  • NLU (Natural Language Understanding)
  • Gesture Recognition
  • Lip Activity Detection
  • Object Detection
  • Gaze Detection
  • Sentiment Detection

Again, what is exciting about this collection of functionality, is that Jarvis is poised to become a true Conversational Agent. But why do I say this?

NVIDIA Jarvis Weather demo chatbot deployed on a AMI and Accessed via SSH Tunnel

We as humans communicate not only in voice, but also by detecting the gaze of the speaker, lip activity, facial expressions etc.

Another key focus are of Jarvis is transfer learning. There is significant cost saving when it comes to taking the advanced base models of Jarvis and repurposing them for specific uses.

The functionality which is currently available in Jarvis 1.0 Beta includes:

  • Automatic Speech Recognition (ASR/STT),
  • Speech Synthesis (TSS)and
  • Natural Language Processing & Understanding (NLU)

Jarvis Dialog Management

Going through the documentation and demo application, there is no doubt about the ASR, TSS and NLU capability of Jarvis. Considering elements like transfer learning and speed of training.

NVIDIA Jarvis Services & Models

However, one element currently not clearly defined is how the Dialog Manager (DM) will be implemented and how the dialog development interface will work.

So, as of yet, NVIDIA Jarvis Dialog Development and Management feature is under development and has not been released yet.

The Virtual Assistant (with Rasa) sample app ( shows how Jarvis ASR, TTS, & NLU can be used with the Rasa Dialog Manager.

Dialog Management, also referred to as state management or conversation flow management will be different for Jarvis than for other conversational environments.

The reason for this is that Jarvis is a multimodal conversational interface which will not only have speech/text input. But also speaker identification, gaze detection. Also know as look-to-talk, hence if a addresses Jarvis and not a passenger for example, and lip activity.

Hence multimodal refers to multi-user and also multi-context and multi-medium input. This implies that the Jarvis DM environment will have to be more complex than traditional text and voice conversational interfaces.

Looking at the Jarvis tools currently available, the dialog management environment will most probably not be graphic. But rather a text configuration approach such like a YML file. Absorbing the multi-modal nature of Jarvis in a simplistic design and development structure will be a challenge.

As mentioned, the multimodality of Jarvis will necessitate a more complex tool to match the input demands.

Dialog Management Environment

As a reference, let’s take a look at how the current conversational agent landscape looks in terms of Dialog Development and Management…

(At this stage, you might want to skip forward to the conclusion.) 🙂

Current Dialog Development & Management Landscape

When different vendors and platforms converge on the same basic approach & principles, it is safe to assume it is the most efficient way of doing it.

Looking at the large and emerging chatbot platforms, they all converge on two key aspects:

  • Intents
  • Entities

There are elements like form or slot filling, policies etc. which is also important. But for the purpose of this story, we will not focus on it.


Intents are usually defined with by a description and a few training or utterance examples.

Rasa Chatbot Framework: intent example with contextual entities defined.

The utterance examples are what a user is anticipated to say or state. The intent they expect to have fulfilled.

Some environments have additional features like intent conflict identification and utterance suggestions.


Entities are the nouns the user enters, often multiple entities compounded in one user utterance.

Contextual Entity Annotation In IBM Watson Assistant

The challenge here is extract the entities when first uttered by the user.

Most environments can extract compound entities and also contextual entities.

So, compound and contextual entities are being implemented by more chatbot platforms.

The option to contextually annotate entities are also on the rise. Often entities have a finite set of values which are defined. Then there are entities which cannot be represented by a finite list; like cities in the world or names, or addresses.

These entity types have too many variations to be listed individually.

Dialog Creation: Development & Management

The area where there is a divergence rather than a convergence is that of Dialog Development and Management.

Meaning that for some the verdict is still out on what approach is ideal…

Dialog Creation

The dialog component has the responsibility of managing the state of the conversation, turn-by-turn. This is also referred to as the state machine, dialog flow (in generic terms and not Google’s platform), conversation design or dialog management.

It needs to be stated some environments have a clear separation between their NLU component and the core or dialog components. Rasa, Microsoft and AWS fall in this category.

IBM Watson Assistant has the opposite approach where the two are really merged.

Below I look at four distinct approaches currently followed in the chatbot marketplace to dialog creation and management.

1. Design Canvas

A design canvas environment is part and parcel of the new Bot Society design environment.

Botsociety Design Canvas

In the Botsociety environment designs can be deployed to solutions like Rasa, Microsoft Bot Framework and more.

It is evident that Botsociety is becoming more technical nature and complex.

Cleary the product is in the process of morphing from a design, presentation and prototype only tool, into a conversation development tool.

You can choose to export your design to:

  • Bot Framework / Azure
  • Dialogflow
  • Or to own codebase (API)

Dialogflow CX also leverages this canvas approach where you can map out a complex conversation and expand pages or conversational nodes.

Google Dialogflow CX: Dialog development interface where the conversation state is managed.

This approach is easy to kick off a chatbot project, and designers feel comfortable initially.

But as complexity grows, scaling and management are impacted.

Advantageous of this approach are:

  • Ease of collaboration
  • Panning and viewing of the design
  • Zoom in and out to see more or less detail
  • Combining of the design and development process.
  • Suitable for quick prototyping & cocreation.

Disadvantageousness of this approach are:

  • Complexity of large implementations
  • Change management and impact assessments
  • Troubleshooting and identifying conversation break points.
  • Multiple conditions per dialog node which are impacted when parameters change.

2. Dialog Configuration

You might ask what is the difference between a design canvas and dialog configuration…

IBM Watson Assistant: Dialog Nodes

Dialog configuration is an approach where you don’t quite have a canvas to design on, but conversational nodes are defined graphically.

These design nodes are in a linear fashion and the development environment is more rigid and sequential.

Within each dialog node conditions are set, and the conversation can skip up or down within the sequence, which can lead to confusion.

IBM Watson Assistant follows this design principle.

For each dialog node conditions can be set and certain outcomes defined. Dialogflow ES also reminds of a more dialog configuration approach together with Microsoft Composer. Microsoft Power Virtual Agents are also based on a dialog configuration approach.

Microsoft Bot Framework Composer: Flow Design Interface

Advantageous of this approach are:

  • Slightly more condensed presentation of the conversation
  • Restrictive nature prohibits impulsive changes.
  • More technical in nature with varying levels of configuration.
  • Suitable for quick prototyping.

Disadvantageousness of this approach are:

  • Difficult to present and perform walk-through
  • For larger conversations there is mounting complexity and cross-referencing.
  • Mindfulness of how parameter and settings changes will cascade.
  • Not suited as a conversation design tool.

3. Native Code

Native code makes the case for a highly flexible and scalable environment. Solutions which come to mind in this category is Amazon Lex, and to some extend Alexa skills.

But especially Microsoft Bot Framework running on native code. The advantage here is that non-propriety code can be used. In the case of Microsoft Bot Framework C# or Node.js can be used. In the case of Lex or Alexa skills; Lambda functions, you will most probably use Node.js or Python.

Native code affords you much more agility and flexibility. Although there is a chasm between design and implementation. Here a shared understanding needs to be established and cocreation is inhibited.

There are also solutions which use propriety code for the dialog, such as Oracle with their BotML.

Advantageous of this approach are:

  • Non-propriety in terms of development environment language.
  • Flexible and accommodating to change in scaling (in principle)
  • Non-dedicated, specific skills or specific knowledge required.
  • Porting of code, or even re-use.

Disadvantageousness of this approach are:

  • Design and implementation is far removed from each other.
  • Design interpretation might be a challenge.
  • Another, most probably dedicated, design tool will be required.
  • The complexity of managing different permutations in the dialog still needs to exist; within the code.

4. ML Stories

Here Rasa finds itself alone in this category; invented and pioneered by them. Where they apply ML, and the framework calculates the probable next conversational node from a basis of user stories.

- story: collect restaurant booking info # name of the story - just for debugging
- intent: greet # user message with no entities
- action: utter_ask_howcanhelp
- intent: inform # user message with no entities
- location: "rome"
- price: "cheap"
- action: utter_on_it # action that the bot should execute
- action: utter_ask_cuisine
- intent: inform
- cuisine: "spanish"
- action: utter_ask_num_people

Stories example from: # 👆

Rasa’s approach seems quite counter intuitive…instead of defining conditions and rules for each node, the chatbot is presented with real conversations. The chatbot then learns from these conversational sequences, to manage future conversations.

These different conversations, referred to as Rasa Stories, are the training data employed for creating the dialog management models.

Slots and forms can be incorporated…and the idea if CDD (Conversation-Driven Development) underpins continuous improvement of the models.

Rasa-X: Conversation-Driven Design Interface

Advantageous of this approach are:

  • Everyone knows the state machine needs to be deprecated; this achieves that.
  • Training time is reasonable.
  • No dedicated or specific hardware required.
  • No dedicated ML experts and data scientists required…AI for the masses.
  • Complexity is hidden in presented in a simplistic way.

Disadvantageousness of this approach are:

  • This approach may seem abstract and intangible to some.
  • Apprehension in instances where mandatory data needs to be collected. Or where legislation dictates conditions. However, here Form Policies comes into play.


Jarvis is bringing deep learning to the masses. Considering the processing power and optimization of Jarvis with the NVIDIA GPU based on their Turing or Volta architecture, a machine learning approach seems more plausible. Hence a scenario where probable conversational paths are defined. Elements which must be incorporated in the dialog management environment are digression, disambiguation and negating fallback proliferation.

In this configuration Rasa is utilized for dialog management and NLU.

Considering that dialog turns can be initiated with gestures, a gaze; non-definitive input.

However, the non speech or text cues or non-explicit input will pose a challenge and will definitely add complexity.

Transfer learning could be utilized for certain industries, and a dialog state starter pack could be made available, for banking, insurance, telecommunications, medical care etc. for instance.

Jarvis will be a living services. Living in the user’s environment and surfacing via different devices and environments (car, home, office, phone etc.). This leads to a phenomenon know as ambient orchestration. Where patterns are detected in user behavior, and specific information is surfaced and the right time and the right place and device. Hence these living services being orchestrated following the user touchpoints.

This, coupled with multimodal input, will lean heavily towards a ML approach.



Cobus Greyling

I explore and write about all things at the intersection of AI & language; LLMs/NLP/NLU, Chat/Voicebots, CCAI.