Article

Building a Real RAG System – A Practical Engineering Guide

At our AI Hub, as well as the latest TuesdAI session, we moved beyond prompt engineering and focused on something more structural: how to build a Retrieval-Augmented Generation (RAG) system properly – and why it still matters.

While online discussions frequently claim that “RAG is dead” due to expanding context windows in modern LLMs, practical engineering constraints tell a different story. Large context windows increase cost, impact latency, and do not replace structured retrieval strategies. In production systems, retrieval is not optional – it is an architectural decision.

RAG in the Context of LLM Engineering

RAG is part of a broader discipline often referred to as LLM Engineering – the structured application of techniques that make large language models reliable, performant, and production-ready.

Core techniques include:

Prompt engineering

Retrieval-Augmented Generation

Fine-tuning

Agentic workflows

Cost and performance optimization

Guardrails and evaluation

Recent industry data (including insights from Gartner) suggests that a significant percentage of GenAI initiatives will be discontinued due to unclear business value. The pattern is similar to what we’ve seen in large-scale data initiatives over the last decade: collecting data or deploying AI without a clear objective results in cost without measurable return.

RAG directly addresses one key limitation of base LLMs: static knowledge. Instead of retraining a model (which is expensive and produces static updates), RAG retrieves relevant information at inference time and injects it into the prompt. This enables dynamic knowledge extension without modifying model weights.

Fine-tuning still has a place – particularly for output structuring, tone alignment, or domain-specific reasoning – but it is no longer the default approach for knowledge extension. Static fine-tuned knowledge becomes outdated quickly. Retrieval remains dynamic.

From Naive Retrieval to Vector-Based RAG

To illustrate the mechanics, the session started with a deliberately simplified implementation – a “brute force” RAG.

We constructed a mock company knowledge base consisting of structured Markdown documents:

Employees

Clients

Contracts

Announcements

Company policies

In the first iteration, documents were loaded into memory as a dictionary keyed by identifiers (e.g., employee surnames). Query matching was implemented using simple string comparison. If a query contained a matching key, the corresponding document was appended to the prompt under an additional context section.

While simplistic, this approach demonstrated the core RAG principle:

Retrieve relevant context.
Enrich the prompt with that context.
Constrain the model via system instructions (“If you don’t know, say so. Do not fabricate.”).

Even this naive retrieval layer improved reliability compared to direct prompting.

The second iteration introduced a proper vector-based implementation using LangChain to orchestrate the components.

The architecture was as follows:

Documents were loaded and enriched with metadata (document type).
Documents were chunked (800-character chunks with overlap) to optimize retrieval granularity.
Each chunk was embedded using an auto-encoding embedding model.
Embeddings were stored in a vector database (Chroma in this example).
Incoming queries were embedded into the same vector space.
A similarity search retrieved semantically closest chunks.
Retrieved context was appended to the prompt.
An autoregressive LLM generated the final response.

The distinction between embedding models (auto-encoding) and generative models (autoregressive) is essential here. The embedding model maps text into a high-dimensional vector space (e.g., 1536 dimensions), where semantically similar content clusters together. The generative model predicts the next token sequence based on enriched context.

To validate retrieval quality, embeddings were projected into 2D and 3D space for visualization. While dimensionality reduction techniques abstract the full vector space, clustering behavior confirmed that semantically related document types grouped together – indicating that retrieval would operate on meaningful proximity rather than keyword matching.

Production Considerations: Cost, Safety, and Architecture

A functional RAG demo is not equivalent to a production AI system.

Production readiness requires:

Guardrails to prevent unsafe or irrelevant outputs

Content moderation

IP protection

Cost optimization strategies

Performance tuning

Cost control is particularly important. Expanding context windows to millions of tokens is technically possible but economically inefficient. Retrieval minimizes token usage by injecting only relevant segments.

Additional optimization strategies discussed included:

Response caching (bypassing LLM calls for repeated queries)

Model cascading (small model first, larger model only if needed)

Retrieval parameter tuning (chunk size, overlap, similarity threshold)

The key takeaway is architectural:
RAG is not a workaround – it is a system design pattern.

It allows dynamic knowledge extension without retraining, supports maintainability through modular abstraction layers (e.g., LLM wrappers), and enables cost-performance balancing.

The narrative that RAG is obsolete ignores operational constraints. In real-world environments – especially in outsourcing and enterprise contexts like Levi9 – structured retrieval remains one of the most practical and scalable approaches to integrating LLMs with proprietary knowledge.

We once again reinforced a core principle:

AI systems are not built with prompts alone.
They are engineered.

***This article is part of the AI9 series, where we walk the talk on AI innovation.*** 

Written by:

Marko Šuker,
Data Architect

Published:

6 March 2026

Building a Real RAG System – A Practical Engineering Guide

March 6, 2026

AWS re:Invent 2025 Recap: Key Highlights from the Zrenjanin Meetup

February 25, 2026

How AI-Powered Automation Transforms Incident Detection in Multi-Cloud Environments

February 20, 2026

AI Challenge: Driving a Culture of Learning and Innovation

February 19, 2026

From Management to Leadership: Investing in Our People’s Growth

February 11, 2026

The Rise of AI Orchestration in Software Development

February 6, 2026

Building a Real RAG System – A Practical Engineering Guide

RAG in the Context of LLM Engineering

From Naive Retrieval to Vector-Based RAG

Production Considerations: Cost, Safety, and Architecture

In this article:

Written by:

Marko Šuker,
Data Architect

Related posts

Building a Real RAG System – A Practical Engineering Guide

AWS re:Invent 2025 Recap: Key Highlights from the Zrenjanin Meetup

How AI-Powered Automation Transforms Incident Detection in Multi-Cloud Environments

AI Challenge: Driving a Culture of Learning and Innovation

From Management to Leadership: Investing in Our People’s Growth

The Rise of AI Orchestration in Software Development

Contact us

Explore

Join us

Building a Real RAG System – A Practical Engineering Guide

RAG in the Context of LLM Engineering

From Naive Retrieval to Vector-Based RAG

Production Considerations: Cost, Safety, and Architecture

In this article:

Written by:

Marko Šuker,Data Architect

Related posts

Building a Real RAG System – A Practical Engineering Guide

AWS re:Invent 2025 Recap: Key Highlights from the Zrenjanin Meetup

How AI-Powered Automation Transforms Incident Detection in Multi-Cloud Environments

AI Challenge: Driving a Culture of Learning and Innovation

From Management to Leadership: Investing in Our People’s Growth

The Rise of AI Orchestration in Software Development

Marko Šuker,
Data Architect