Sitemap
Press enter or click to view image in full size
Nemotron Nano is NVIDIA’s latest open-sourced model which is focussed on multiple languages, input modalities and high throughput to address inference latency. Reasoning can be toggled on or off to optimise according to workload.

NVIDIA is moving beyond hardware to software ecosystem dominance

4 min readOct 30, 2025

--

There is something I have been noticing, that is NVIDIA’s groundswell around software and models….in this post I will be trying to give language to it…

NVIDIA’s release of the Nemotron-Nano-12B-v2-VL-FP8 and related models fit neatly into their broader push on open-sourcing efficient and versatile models.

This includes specialised small language models which can be orchestrated in Agentic Workflows, all the while tightening the loop around their hardware stack.

NVIDIA’s explicitly framing SLMs as the backbone for scalable agentic systems, where you compose multiple lightweight specialist models (for example, one for vision-RAG, another for guardrails) rather than relying on giant monolithic model.

Their recent research paper and dev blogs emphasises this, that SLMs are economical and technically suitable for agentic workflows because they match or beat larger models on tool-use/coding tasks, run edge-side without cloud dependency and enable faster iteration across orgs.

The Nano VL variant is tuned for exactly that — e.g., extracting invoice data from videos/images or summarising multi-doc comparisons, making it plug-and-play for agent orchestration.

Looking at the example below, the invoices are from a dataset in HuggingFace, and leveraging the SLM, the invoices can be uploaded, and questions can be asked like

Sum up all the totals across the receipts

or

Here are 4 invoices flagged as potential duplicates — are they actually the same document with minor layout differences?

Press enter or click to view image in full size
Source

This is an interesting example because the model is able to perform spatial reasoning by comparing multiple invoices (images) in real-time.

Back to the hardware…

Open models like Nemotron lower the barrier for developers to experiment, but they’re optimised for NVIDIA hardware.

Their new DGX Spark workstation is a compact ARM64-based personal AI supercomputer for prototyping agents and models right on your desk.

It’s pitched as the entry-point for researchers to build in NVIDIA’s environments, ensuring local work ports to enterprise without friction.

No easy AMD/Intel swaps, but that’s the point right?

NVIDIA is creating momentum for their hardware moat.

And I think the challenge is that NVIDIA is really the most advanced with their approach to model orchestration, continuous fine-tuning and the data flywheel for a real-time feedback loop.

For me, the biggest impediment to experimenting with NVIDIA was access and cost of hardware, this is changing with the Spark.

With NVIDIA is moving into consumer hardware.

Press enter or click to view image in full size
Source

Below is the full post to the NVIDIA data flywheel approach…

So once you have the hardware and the environment is ready, you have access to so many models, notebooks and cookbooks…this is NVIDIA’s opportunity to capture the way of work and how best practices are seen.

The example below shows how a presentation in PDF is uploaded, and highly contextual questions can be asked based on the presentation.

Press enter or click to view image in full size
Press enter or click to view image in full size
text_prompt = "How much did Data Center business grow in Q2 FY26?"

text_prompt = "Which business unit had the most growth Y/Y?"
Press enter or click to view image in full size

In Short

Small Language Models (SLMs) will be orchestrated to perform specific tasks in agentic workflows.

SLMs will be fine-tuned on a regular cadence based on a data flywheel.

Usage data will be curated and used to optimise different aspects of the agentic workflow.

SLMs will be optimised to perform specific pinpointed tasks.

Laser focus will be placed on two aspects, accuracy of tool selection and orchestration of multiple/parallel to optimise for inference latency.

Press enter or click to view image in full size
Source

What is holding us back, is compute.

I see NVIDIA DGX Spark is the first step in enabling all of this. Giving access to individuals to freely prototype, fine-tune, perform production grace inference and build edge applications.

Press enter or click to view image in full size

Chief Evangelist @ Kore.ai | I’m passionate about exploring the intersection of AI and language. Language Models, AI Agents, Agentic Apps, Dev Frameworks & Data-Driven Tools shaping tomorrow.

Press enter or click to view image in full size

--

--

Cobus Greyling
Cobus Greyling

Written by Cobus Greyling

I’m passionate about exploring the intersection of AI & language. www.cobusgreyling.com

No responses yet