At our AI Hub, as well as the latest TuesdAI session, we moved beyond prompt engineering and focused on something more structural: how to build a Retrieval-Augmented Generation (RAG) system properly – and why it still matters.
While online discussions frequently claim that “RAG is dead” due to expanding context windows in modern LLMs, practical engineering constraints tell a different story. Large context windows increase cost, impact latency, and do not replace structured retrieval strategies. In production systems, retrieval is not optional – it is an architectural decision.
RAG in the Context of LLM Engineering
RAG is part of a broader discipline often referred to as LLM Engineering – the structured application of techniques that make large language models reliable, performant, and production-ready.
Core techniques include:
- Prompt engineering
- Retrieval-Augmented Generation
- Fine-tuning
- Agentic workflows
- Cost and performance optimization
- Guardrails and evaluation
Recent industry data (including insights from Gartner) suggests that a significant percentage of GenAI initiatives will be discontinued due to unclear business value. The pattern is similar to what we’ve seen in large-scale data initiatives over the last decade: collecting data or deploying AI without a clear objective results in cost without measurable return.
RAG directly addresses one key limitation of base LLMs: static knowledge. Instead of retraining a model (which is expensive and produces static updates), RAG retrieves relevant information at inference time and injects it into the prompt. This enables dynamic knowledge extension without modifying model weights.
Fine-tuning still has a place – particularly for output structuring, tone alignment, or domain-specific reasoning – but it is no longer the default approach for knowledge extension. Static fine-tuned knowledge becomes outdated quickly. Retrieval remains dynamic.
From Naive Retrieval to Vector-Based RAG
To illustrate the mechanics, the session started with a deliberately simplified implementation – a “brute force” RAG.
We constructed a mock company knowledge base consisting of structured Markdown documents:
- Employees
- Clients
- Contracts
- Announcements
- Company policies
In the first iteration, documents were loaded into memory as a dictionary keyed by identifiers (e.g., employee surnames). Query matching was implemented using simple string comparison. If a query contained a matching key, the corresponding document was appended to the prompt under an additional context section.
While simplistic, this approach demonstrated the core RAG principle:
- Retrieve relevant context.
- Enrich the prompt with that context.
- Constrain the model via system instructions (“If you don’t know, say so. Do not fabricate.”).
Even this naive retrieval layer improved reliability compared to direct prompting.
The second iteration introduced a proper vector-based implementation using LangChain to orchestrate the components.
The architecture was as follows:
- Documents were loaded and enriched with metadata (document type).
- Documents were chunked (800-character chunks with overlap) to optimize retrieval granularity.
- Each chunk was embedded using an auto-encoding embedding model.
- Embeddings were stored in a vector database (Chroma in this example).
- Incoming queries were embedded into the same vector space.
- A similarity search retrieved semantically closest chunks.
- Retrieved context was appended to the prompt.
- An autoregressive LLM generated the final response.
The distinction between embedding models (auto-encoding) and generative models (autoregressive) is essential here. The embedding model maps text into a high-dimensional vector space (e.g., 1536 dimensions), where semantically similar content clusters together. The generative model predicts the next token sequence based on enriched context.
To validate retrieval quality, embeddings were projected into 2D and 3D space for visualization. While dimensionality reduction techniques abstract the full vector space, clustering behavior confirmed that semantically related document types grouped together – indicating that retrieval would operate on meaningful proximity rather than keyword matching.
Production Considerations: Cost, Safety, and Architecture
A functional RAG demo is not equivalent to a production AI system.
Production readiness requires:
- Guardrails to prevent unsafe or irrelevant outputs
- Content moderation
- IP protection
- Cost optimization strategies
- Performance tuning
Cost control is particularly important. Expanding context windows to millions of tokens is technically possible but economically inefficient. Retrieval minimizes token usage by injecting only relevant segments.
Additional optimization strategies discussed included:
- Response caching (bypassing LLM calls for repeated queries)
- Model cascading (small model first, larger model only if needed)
- Retrieval parameter tuning (chunk size, overlap, similarity threshold)
The key takeaway is architectural:
RAG is not a workaround – it is a system design pattern.
It allows dynamic knowledge extension without retraining, supports maintainability through modular abstraction layers (e.g., LLM wrappers), and enables cost-performance balancing.
The narrative that RAG is obsolete ignores operational constraints. In real-world environments – especially in outsourcing and enterprise contexts like Levi9 – structured retrieval remains one of the most practical and scalable approaches to integrating LLMs with proprietary knowledge.
We once again reinforced a core principle:
AI systems are not built with prompts alone.
They are engineered.
***This article is part of the AI9 series, where we walk the talk on AI innovation.***





