NVIDIA Nemotron Nano 2 VL
NVIDIA Nemotron Nano 2 VL model, a open-sourced vision-language model (VLM) is trained on the most open & transparent datasets…
NVIDIA’s focus with Nemotron Nano 2 VL is:
- Exceptional throughput (cut down on inference latency)
- Precession document intelligence (higher accuracy)
- multi-image reasoning (for instance, comparing different invoices with each other)
- video understanding (explain the video content as it unfolds)
NVIDIA Nemotron is a family of open-sourced models, datasets and technologies that allowing developers to build agentic AI.
Nemotron Nano 2 VL is a 12B multimodal reasoning model optimised for throughput to address inference latency. This is required for input across text, images, tables and videos.
Built on NVIDIA’s robust backbone, Nemotron Nano 2 VL comes with the most permissive license for unrestricted innovation for developers.
Developers have access to fully open model weights and 11 million high-quality training samples on Hugging Face.
NVIDIA Nemotron Nano 2 VL model is compatible with & performs efficiently on the NVIDIA DGX Spark.
Before I get into the technical details, just to mention the phrase from NVIDIA “AI Systems of Models”…
As I have mentioned before, there is a shift from AI Agents to Agentic Workflows…
Where AI Agents are being replaced by the notion of Agentic Workflows that orchestrate multiple AI Agents within a structured workflow.
While traditional AI Agents were code-heavy and relied on a single Language Model as their backbone, Agentic Workflows allow tasks to be queued or executed in parallel, with agents themselves becoming less code-intensive by leveraging the in-built reasoning of Language Models.
Workflows also orchestrate multiple Language Models themselves to optimise performance, selecting the best model for efficiency, cost and purpose.
An increasing number of no-code Agentic Workflows are emerging, though there remains a place for lightweight code-based ones, as this article demonstrates.
Here are a number of practical Python examples…
The most basic application using the OpenAI SDK:
Querying a video…
text_prompt = "Describe this video in detail"
reasoning_mode = False
temperature = 0.0
max_tokens=4096
call_nemotron_nano_2_vl(image_urls=[], video_urls=['https://blogs.nvidia.com/wp-content/uploads/2023/04/nvidia-studio-itns-wk53-scene-in-omniverse-1280w.mp4'], text_prompt=text_prompt, reasoning_mode=reasoning_mode, temperature=temperature, max_tokens=max_tokens)With the output…
This video demonstrates a 3D model of a hut surrounded by
trees and snow. The hut is brown with a snow-covered roof.
The surrounding area is covered in snow, with scattered
logs and rocks. A person is standing outside the hut,
and there is a table with a pot on it.
The trees are tall and dark green, and the sky is overcast.
The video is shot from a top-down perspective, and the model
is rotating to show different angles.Image querying…get the dataset…
import pyarrow.parquet as pq
import matplotlib.pyplot as plt
import io
!wget https://huggingface.co/datasets/katanaml-org/invoices-donut-data-v1/resolve/main/data/test-00000-of-00001-56af6bd5ff7eb34d.parquet
table = pq.read_table("test-00000-of-00001-56af6bd5ff7eb34d.parquet")
df = table.to_pandas()And querying the images…
text_prompt = "Sum up all the totals across the reciepts."
reasoning_mode = True
temperature = 0.6
max_tokens=32000
call_nemotron_nano_2_vl([img1_url, img2_url, img3_url, img4_url], video_urls=[], text_prompt=text_prompt, reasoning_mode=reasoning_mode, temperature=temperature, max_tokens=max_tokens)And a more advanced query…
text_prompt = "Here are 4 invoices flagged as potential duplicates — are they actually the same document with minor layout differences?"
reasoning_mode = True
temperature = 0.6
max_tokens=32000
call_nemotron_nano_2_vl([img1_url, img2_url, img3_url, img4_url], video_urls=[], text_prompt=text_prompt, reasoning_mode=reasoning_mode, temperature=temperature, max_tokens=max_tokens)And the output…
Okay, let's tackle this problem. The user has provided four invoices and wants to know if they're duplicates with minor layout differences. First, I need to compare each invoice's details to see if they're the same.
Starting with Invoice 1: Invoice number 26343874, dated 07/17/2013. Seller is Smith Ltd, client is Herring-Floyd. Items include Xbox, cables, PS Slim, Game Boy Micro, and GameCube. Net worth is $3,172.99, VAT $317.30, total $3,490.29.
Invoice 2: Number 25698672, dated 08/01/2015. Seller Gordon Alvarado and Jones, client Manning-Ho. Items are Cisco books, a book, and another book. Net worth $115.45, VAT $11.55, total $127.00. These are completely different items and net values, so not duplicates.
Invoice 3: Number 72455228, dated 05/11/2021. Seller Potts, Reed and Miller, client Brady LLC. Items include PlayStation consoles and a Gameboy. Net worth $1,959.98, VAT $196.00, total $2,155.98. Different items and much higher total, so not duplicates.
Invoice 4: Number 90508672, dated 04/16/2020. Seller Stone-Waller, client Shaw Group. Items are sofas, chairs, tables, and a dining table. Net worth $6,411.35, VAT $641.14, total $7,052.49. These are furniture items, different net values.
Looking at all four, each has unique invoice numbers, dates, sellers, clients, items, and totals. The items listed are entirely different across all invoices. Even though they all have VAT percentages, the actual amounts vary. The structure of the invoices is similar (header, items, summary), but the content varies significantly. Therefore, they aren't duplicates; they're distinct invoices with different details.
The four invoices are **not duplicates**. Each has distinct invoice numbers, dates, sellers, clients, items, and financial totals. While they share a similar structure (header, itemized list, summary), the content (e.g., products, net worth, VAT amounts) varies significantly across all four. They represent separate transactions with unique details.Chief Evangelist @ Kore.ai | I’m passionate about exploring the intersection of AI and language. Language Models, AI Agents, Agentic Apps, Dev Frameworks & Data-Driven Tools shaping tomorrow.
