Retrieval-Augmented Reasoning with Lean Language Models
This study addresses the crucial challenges of production AI by supplying a view of the future…
NVIDIA has been vocal on the future of Agentic AI being orchestrated small language models (SLMs).
And the continuous fine-tuning of SLMs.
Then there is this study that highlights crucial impediments for going live with Generative AI systems…
They listed the lack of performant smaller language models…and hence they provide the blueprint to fine-tune an open-source language model to be performant.
They also address privacy concerns, and the fact that production implementations need to achieve model and data sovereignty.
Add to this a secure environment and all the due diligence demanded from an enterprise environment.
And lastly, adhering to the mentioned requirements, bring along challenges of resource constrained environments.
And two key considerations from this study are:
- A highly performant implementation which is lean; hence optimised in various ways.
- And optimising the performance of the model with fine-tuning; also making use of synthetic data.
Considering the image below, DeepSeek-R1 is used as a frontier model to generate reasoning traces over retrieved documents for the queries, which contribute to a training dataset.
The fine-tuning process then applies to a lean model (Qwen2.5–32B-Instruct) using that combined dataset. The synthetic queries themselves are generated by GPT-4o.
The fine-tuning of the Qwen2.5–32B-Instruct model (resulting in variants like t0–1.1-k5–32B) does not negate or replace the need for a retrieval framework (for example LangChain) or a vector database like Chroma.
Instead, it primarily optimizes the model’s reasoning capabilities over retrieved documents, leading to improvements in answer accuracy, consistency, and domain-specific performance.
Fine-tuning the smaller (lean) model is a major part of the process described in the paper.
The approach focuses on taking a lightweight backbone like Qwen2.5-Instruct and optimising it through supervised fine-tuning on a curated dataset of ~2,000 synthetic queries, retrieved document chunks and reasoning traces generated by frontier models.
This distillation enhances the model’s reasoning capabilities over retrieved content in a RAG setup, leading to significant improvements in accuracy while keeping it deployable in privacy-focused, resource-constrained environments.
Without this step, the system would rely more on larger, external models, which defeats the goal of lean, local deployment.
Core Models and Frameworks
Language Models
Backbone
Fine-tuned Qwen2.5-Instruct variants (e.g., Qwen2.5–1.5B-Instruct, Qwen2.5–32B-Instruct from Hugging Face), optimised for reasoning-aware generation over retrieved documents.
Frontier Models for Synthetic Data
DeepSeek-R1 (and variants like DeepSeek-V3, DeepSeek-R1-Zero) used to generate reasoning traces and synthetic queries.
Other Integrated Models
Azure OpenAI models (e.g., GPT-4o, o3-mini, o1) and Gemma3 (e.g., 1B variant) for comparisons or data generation.
Retrieval Frameworks
LangChain
Handles retrievers, document chunking, and integration with vector stores for semantic similarity-based retrieval.
Vector Databases/Stores
Chroma
Primary option for embedding storage and similarity search (supports dense retrieval via embeddings).
Data Processing and Pipeline Tools
Data Handling
Pandoc converts domain-specific data (e.g., HTML from NHS pages) to plain text for indexing.
Custom Scripts for scraping, converting data to datasets (e.g., convert_txt_conditions_to_dataset.pygenerates JSONL files), and synthetic query generation.
Embedding & Retrieval
Dense Retriever based on embedding models (implicitly tied to Hugging Face for embeddings like those from Qwen2.5).
Summarisation based compression applied to documents before indexing to fit context windows.
Serving & API Infrastructure
Web/API Frameworks
FastAPI powers the serving of vector stores, retrievers, and RAG endpoints (e.g., commands like t0–1 serve-vector-store and t0–1 serve-rag).
Inference Servers
vLLM is used for efficient model serving, especially for larger models like the 32B variants (runs on specific ports for local deployment).
Chief Evangelist @ Kore.ai | I’m passionate about exploring the intersection of AI and language. Language Models, AI Agents, Agentic Apps, Dev Frameworks & Data-Driven Tools shaping tomorrow.
