Sitemap
Press enter or click to view image in full size

Retrieval-Augmented Reasoning with Lean Language Models

4 min readSep 23, 2025

--

NVIDIA has been vocal on the future of Agentic AI being orchestrated small language models (SLMs).

And the continuous fine-tuning of SLMs.

Then there is this study that highlights crucial impediments for going live with Generative AI systems…

They listed the lack of performant smaller language models…and hence they provide the blueprint to fine-tune an open-source language model to be performant.

They also address privacy concerns, and the fact that production implementations need to achieve model and data sovereignty.

Press enter or click to view image in full size

Add to this a secure environment and all the due diligence demanded from an enterprise environment.

And lastly, adhering to the mentioned requirements, bring along challenges of resource constrained environments.

And two key considerations from this study are:

  1. A highly performant implementation which is lean; hence optimised in various ways.
  2. And optimising the performance of the model with fine-tuning; also making use of synthetic data.
Press enter or click to view image in full size

Considering the image below, DeepSeek-R1 is used as a frontier model to generate reasoning traces over retrieved documents for the queries, which contribute to a training dataset.

The fine-tuning process then applies to a lean model (Qwen2.5–32B-Instruct) using that combined dataset. The synthetic queries themselves are generated by GPT-4o.

Press enter or click to view image in full size

The fine-tuning of the Qwen2.5–32B-Instruct model (resulting in variants like t0–1.1-k5–32B) does not negate or replace the need for a retrieval framework (for example LangChain) or a vector database like Chroma.

Instead, it primarily optimizes the model’s reasoning capabilities over retrieved documents, leading to improvements in answer accuracy, consistency, and domain-specific performance.

Press enter or click to view image in full size
How the system works in production

Fine-tuning the smaller (lean) model is a major part of the process described in the paper.

The approach focuses on taking a lightweight backbone like Qwen2.5-Instruct and optimising it through supervised fine-tuning on a curated dataset of ~2,000 synthetic queries, retrieved document chunks and reasoning traces generated by frontier models.

This distillation enhances the model’s reasoning capabilities over retrieved content in a RAG setup, leading to significant improvements in accuracy while keeping it deployable in privacy-focused, resource-constrained environments.

Without this step, the system would rely more on larger, external models, which defeats the goal of lean, local deployment.

Press enter or click to view image in full size
Example snapshot of the chat interface

Core Models and Frameworks

Language Models

Backbone

Fine-tuned Qwen2.5-Instruct variants (e.g., Qwen2.5–1.5B-Instruct, Qwen2.5–32B-Instruct from Hugging Face), optimised for reasoning-aware generation over retrieved documents.

Frontier Models for Synthetic Data

DeepSeek-R1 (and variants like DeepSeek-V3, DeepSeek-R1-Zero) used to generate reasoning traces and synthetic queries.

Other Integrated Models

Azure OpenAI models (e.g., GPT-4o, o3-mini, o1) and Gemma3 (e.g., 1B variant) for comparisons or data generation.

Retrieval Frameworks

LangChain

Handles retrievers, document chunking, and integration with vector stores for semantic similarity-based retrieval.

Vector Databases/Stores

Chroma

Primary option for embedding storage and similarity search (supports dense retrieval via embeddings).

Data Processing and Pipeline Tools

Data Handling

Pandoc converts domain-specific data (e.g., HTML from NHS pages) to plain text for indexing.

Custom Scripts for scraping, converting data to datasets (e.g., convert_txt_conditions_to_dataset.pygenerates JSONL files), and synthetic query generation.

Embedding & Retrieval

Dense Retriever based on embedding models (implicitly tied to Hugging Face for embeddings like those from Qwen2.5).

Summarisation based compression applied to documents before indexing to fit context windows.

Serving & API Infrastructure

Web/API Frameworks

FastAPI powers the serving of vector stores, retrievers, and RAG endpoints (e.g., commands like t0–1 serve-vector-store and t0–1 serve-rag).

Inference Servers

vLLM is used for efficient model serving, especially for larger models like the 32B variants (runs on specific ports for local deployment).

Press enter or click to view image in full size

Chief Evangelist @ Kore.ai | I’m passionate about exploring the intersection of AI and language. Language Models, AI Agents, Agentic Apps, Dev Frameworks & Data-Driven Tools shaping tomorrow.

Press enter or click to view image in full size

--

--

Cobus Greyling
Cobus Greyling

Written by Cobus Greyling

I’m passionate about exploring the intersection of AI & language. www.cobusgreyling.com

Responses (1)