RAG Architecture Explained: Building AI Applications That Know Your Business

Large language models (LLMs) are remarkably capable - but they have a critical limitation that makes them difficult to use for most business applications out of the box: they only know what they were trained on.

An LLM trained on data through mid-2024 doesn't know about your Q4 product launch, your updated pricing structure, your latest customer support policies, or anything else that happened after its training cutoff. More fundamentally, it doesn't know anything about your specific business, your internal documentation, or your proprietary data - regardless of the training cutoff.

Retrieval-Augmented Generation (RAG) solves this problem. It's the technique that allows LLMs to accurately answer questions about information they were never trained on - by finding the relevant information at query time and providing it directly in the prompt.

At Agentixly, RAG is the foundation of most of the AI applications we build for clients: internal knowledge bases, customer support agents, document analysis systems, and AI-powered search. This guide explains how RAG works, how to architect it well, and what pitfalls to avoid.

What Is RAG? The Core Concept

RAG stands for Retrieval-Augmented Generation. The name describes exactly what it does:

Retrieval - when a user asks a question, the system searches a knowledge base to find the most relevant information
Augmented - the retrieved information is added to the prompt sent to the LLM
Generation - the LLM generates an answer using both its training knowledge and the retrieved context

A simple example:

Without RAG: User: "What is our refund policy for enterprise customers?" LLM: "I don't have access to your specific refund policy. Generally speaking, enterprise software companies..."

With RAG: The system retrieves the relevant section of your policy documentation, includes it in the prompt, and the LLM responds: "According to your current policy (last updated March 2026), enterprise customers are eligible for a pro-rated refund within 90 days of purchase..."

The response is accurate, specific, and current - even though the LLM has never seen your policy document in training.

The RAG Architecture: All the Components

A production RAG system consists of several interconnected components. Let's walk through each.

1. The Knowledge Base / Document Store

The foundation of any RAG system is the collection of documents you want to make searchable. This can include:

Product documentation and help articles
Internal wikis and knowledge bases
Customer support tickets and resolutions
Legal and compliance documents
Sales playbooks and pricing sheets
Research papers and industry reports
Code repositories and technical documentation
Database records and structured data (via text serialization)

The document store is typically a combination of your existing content management system (where content is authored) and a vector database (where it's indexed for fast retrieval).

2. The Chunking Layer

LLMs have context window limits - they can only process a certain amount of text at once. You can't retrieve entire documents and stuff them into a prompt. You need to break documents into smaller chunks that:

Are small enough to fit multiple relevant chunks in a single prompt
Are large enough to contain self-sufficient, coherent information
Have meaningful overlap to avoid losing context at chunk boundaries

Chunking strategies:

Fixed-size chunking - split text every N characters or tokens. Simple but often breaks in the middle of sentences or paragraphs, losing semantic coherence.

Recursive character splitting - split at semantic boundaries (paragraphs, sentences) first, then fall back to character limits. A better default for most text.

Semantic chunking - use embedding similarity to identify natural topic breaks and split there. More computationally expensive but produces more coherent chunks.

Document-structure-aware chunking - use the document's own structure (headings, sections, bullet points) to define chunk boundaries. Best for structured documents like wikis or documentation.

Practical chunk size: 512–1024 tokens is a common starting point. Larger chunks provide more context; smaller chunks improve retrieval precision. Test both on your specific data.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,      # tokens per chunk
    chunk_overlap=200,    # overlap between consecutive chunks
    separators=["\n\n", "\n", ". ", " "]  # Try these separators in order
)

chunks = splitter.split_text(document_text)

3. The Embedding Model

Embeddings are the magic that makes semantic search possible. An embedding model converts text into a numerical vector (typically 768–4096 dimensions) that captures semantic meaning - not just keywords.

Text with similar meaning produces similar vectors, even if the exact words differ. So a user asking "How do I cancel my subscription?" will retrieve documents about "account cancellation" and "ending your plan" - not just documents containing the exact words "cancel my subscription."

Choosing an embedding model:

OpenAI text-embedding-3-large - high quality, widely used, good balance of performance and cost. 3072 dimensions.

Voyage AI models - strong performance on retrieval benchmarks, particularly for code and technical content.

Cohere embed-v3 - competitive quality with flexible dimensionality options.

Open-source options (Sentence Transformers, BGE models) - free to run locally or on your own infrastructure. Useful for data privacy requirements.

Key insight: Use the same embedding model for both indexing documents and embedding queries. Mixing models produces incomparable vectors.

4. The Vector Database

Embedded chunks need to be stored somewhere that supports fast nearest-neighbor search - finding the chunks whose embedding vectors are most similar to the query vector.

Managed vector databases:

Pinecone - fully managed, production-ready, generous free tier
Weaviate Cloud - managed Weaviate, supports hybrid search (vector + keyword)
Qdrant Cloud - open-source core with managed cloud option

Self-hosted options:

pgvector - PostgreSQL extension. Best choice if you're already using PostgreSQL; avoid adding a separate service
Qdrant - high performance, easy to self-host
Chroma - lightweight, excellent for development

Recommendation: For most startups, pgvector is the right choice if you're on PostgreSQL, or Pinecone if you want a fully managed solution without database administration overhead.

5. The Retrieval Layer

At query time, the retrieval layer:

Embeds the user's query with the same embedding model used for documents
Finds the top-K most similar chunks in the vector database
Optionally re-ranks the retrieved chunks for relevance
Returns the best chunks for inclusion in the LLM prompt

Retrieval strategies:

Naive vector search - find the K chunks with highest cosine similarity to the query. Simple, fast, but purely semantic.

Hybrid search - combine vector search with keyword search (BM25). This catches cases where semantic search misses exact-match terms (product names, IDs, technical terms). Most production RAG systems use hybrid search.

Re-ranking - after initial retrieval, use a cross-encoder model to re-score retrieved chunks for relevance to the specific query. More expensive but significantly improves precision. Models like Cohere Rerank or cross-encoders from Hugging Face work well.

MMR (Maximal Marginal Relevance) - diversify retrieved results to avoid returning 5 chunks that say the same thing. Useful when your knowledge base has a lot of redundant content.

# Hybrid search with pgvector (simplified)
async def hybrid_search(query: str, k: int = 5) -> list[Document]:
    query_embedding = embed(query)

    # Vector search
    vector_results = await db.fetch(
        """
        SELECT id, content, metadata,
               1 - (embedding <=> $1::vector) AS similarity
        FROM documents
        ORDER BY similarity DESC
        LIMIT $2
        """,
        query_embedding, k * 2  # Fetch extra for re-ranking
    )

    # Keyword search (BM25)
    keyword_results = await db.fetch(
        """
        SELECT id, content, metadata,
               ts_rank(to_tsvector('english', content), query) AS rank
        FROM documents, to_tsquery('english', $1) query
        WHERE to_tsvector('english', content) @@ query
        ORDER BY rank DESC
        LIMIT $2
        """,
        format_query(query), k * 2
    )

    # Combine and re-rank
    combined = merge_results(vector_results, keyword_results)
    return rerank(query, combined)[:k]

6. The Generation Layer (The LLM)

The final component is the LLM that generates the answer from the retrieved context. The prompt typically follows this structure:

You are a helpful assistant for [Company Name].
Answer the user's question using ONLY the information provided
in the context below. If the context doesn't contain enough
information to answer the question, say so.

<context>
[Retrieved chunks go here]
</context>

User question: [Query]

Answer:

Critical prompt engineering decisions for RAG:

Instruction to use only the provided context - prevents the LLM from hallucinating information not in your knowledge base. Without this, the LLM will confidently blend retrieved information with training knowledge, making it impossible to trace which is which.

Citation instructions - instruct the LLM to cite which document or section each piece of information came from. This is critical for enterprise RAG systems where users need to verify information.

Uncertainty handling - explicitly tell the LLM what to do when the context doesn't contain enough information to answer.

Advanced RAG Patterns

Multi-Vector Retrieval (Parent-Child Chunking)

A powerful technique: index small "child" chunks for high-precision retrieval, but return the larger "parent" chunk in the prompt for full context.

This solves a common problem: small chunks are more precisely matched to queries, but they lack context. Large chunks have context but are imprecisely matched.

# Index small child chunks for retrieval
child_chunks = split_into_small_chunks(document, size=200)
index_with_parent_reference(child_chunks, parent_id=document.id)

# At retrieval time: find matching child chunks, return parent documents
matching_children = vector_search(query)
parent_documents = fetch_parents([c.parent_id for c in matching_children])
return parent_documents  # Full documents in the prompt

Query Transformation

User queries are often poorly formulated for retrieval. Query transformation improves retrieval quality by rewriting or expanding the query before searching.

HyDE (Hypothetical Document Embedding) - ask the LLM to generate a hypothetical answer to the query, then use that as the search query. This often produces better retrieval than using the original question directly.

Query expansion - generate multiple variations of the query and search with all of them, then deduplicate and re-rank results.

Decomposition - for complex multi-part questions, decompose into sub-questions, answer each, then synthesize a final answer.

Agentic RAG

The most sophisticated RAG systems give an AI agent tools to perform iterative retrieval - searching, evaluating results, refining the search, and repeating until it has sufficient information to answer confidently.

This handles questions that can't be answered with a single retrieval step - questions requiring information from multiple documents or requiring inference across retrieved facts.

Evaluating RAG System Quality

RAG systems are notoriously difficult to evaluate because errors can occur at the retrieval stage, the generation stage, or both.

Key RAG Metrics

Retrieval Precision - of the chunks retrieved, what fraction are actually relevant to the query? Measure against a manually labeled evaluation set.

Retrieval Recall - of the relevant chunks that exist in the knowledge base, what fraction did the retrieval system find? A high-recall system is important when missing a relevant document could mean giving a wrong answer.

Answer Faithfulness - does the generated answer accurately reflect what the retrieved context says, without adding information not in the context? Hallucination detection.

Answer Relevance - does the generated answer actually address the user's question?

Context Relevance - are the retrieved chunks actually relevant to the question?

Use frameworks like RAGAS or LangSmith to automate evaluation across your test set.

Building an Evaluation Dataset

Create a "golden dataset" of question-answer pairs for your specific knowledge base:

Collect 50–100 questions your system should be able to answer
Have human experts write the correct answers
Identify which documents contain the information needed for each answer
Run your RAG system against this dataset
Measure precision, recall, faithfulness, and relevance scores
Use this dataset to evaluate changes before deployment

Common RAG Pitfalls and How to Avoid Them

Poor chunking - chunks that break mid-sentence or mid-thought lose context and reduce retrieval quality. Use structure-aware chunking for structured documents.

Wrong chunk size - too small (< 100 tokens) loses context; too large (> 2000 tokens) reduces retrieval precision. Test on your specific data.

Single embedding model for all content types - code, legal documents, and customer support tickets have very different semantic properties. Consider domain-specific embedding models or separate indexes for different content types.

No re-ranking - naive vector search returns semantically similar chunks but not necessarily the most relevant ones. Add re-ranking to significantly improve answer quality.

No evaluation - deploying RAG without measuring retrieval quality means you're flying blind. Build the evaluation dataset before launch.

Ignoring document freshness - if your knowledge base is updated frequently, stale chunks in the index produce wrong answers. Implement incremental indexing so new and updated documents are re-indexed promptly.

No citation in outputs - users of enterprise AI applications need to verify information. If your RAG system doesn't cite sources, trust is impossible to build.

How Agentixly Builds Production RAG Systems

At Agentixly, we've built RAG systems for use cases ranging from customer support automation to legal document analysis to internal knowledge bases for enterprises with 10,000+ documents.

Our RAG stack for production systems:

Embedding: OpenAI text-embedding-3-large or Voyage AI depending on content type
Storage: pgvector for most clients (leverages existing PostgreSQL); Pinecone for large-scale dedicated deployments
Retrieval: Hybrid vector + BM25 search with Cohere re-ranking
Chunking: Structure-aware chunking with parent-child retrieval for long documents
LLM: Claude for generation (superior instruction following and citation behavior)
Evaluation: RAGAS-based automated evaluation with golden dataset
Observability: LangSmith for tracing and monitoring

The result is a RAG system that delivers accurate, cited, verifiable answers - and continues to improve as we add more content and optimize based on production feedback.

If your company is exploring AI applications that need to reason over your proprietary data, Agentixly can help you design and build the right RAG architecture. Reach out to our team to get started.