You've built the demo. You chunked your documents, embedded them into Pinecone or Qdrant, wired up a retrieval step, passed the results to GPT-4, and it worked. It answered questions about your documents. You showed the client. Everyone was impressed.
Then you tried it on real documents, with real users, and it fell apart.
This happens on nearly every RAG project that skips the architectural decisions that actually matter. After building RAG systems for legal document review, financial report analysis, and customer support — each with different document types, retrieval requirements, and failure modes — here's what we've learned separates a working demo from a production system.
Why Naive RAG Fails in Production
Naive RAG is: split document by fixed token count, embed chunks, retrieve top-k by cosine similarity, inject into prompt. It works in demos because demos use clean documents and handpicked questions. Production breaks it in three ways:
Chunk boundary problems. A fixed-size chunker doesn't respect semantic boundaries. You split a legal clause mid-sentence, embed both halves, and neither half retrieves well for queries about that clause. The answer exists in your corpus — your retrieval just can't find it.
Retrieval quality at scale. Cosine similarity over raw embeddings degrades as the corpus grows and query distribution widens. You start seeing irrelevant chunks retrieved with high similarity scores, especially for short or ambiguous queries.
Hallucination on gaps. When retrieval fails silently — when the relevant chunk doesn't surface — the LLM fills the gap with plausible-sounding fabrication. Users don't know the difference. In legal or financial contexts, this is a liability problem, not just a quality problem.
Chunking Strategy: The Decision That Matters Most
Chunking is the highest-leverage decision in RAG architecture and the one teams spend the least time on.
The problem with fixed-size chunking isn't just the boundary issue — it's that it treats all documents identically. A 512-token chunk of a legal contract is not the same unit as a 512-token chunk of a support knowledge base article. The semantic density, reference structure, and query patterns are completely different.
Our current approach for most production systems:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from typing import List
def semantic_chunk(
text: str,
chunk_size: int = 800,
chunk_overlap: int = 150,
separators: List[str] = None,
) -> List[str]:
"""
Chunk with semantic separators and overlap for context preservation.
Separators ordered from most to least structurally meaningful.
"""
if separators is None:
separators = [
"\n## ", # H2 headings
"\n### ", # H3 headings
"\n\n", # Paragraph breaks
"\n", # Line breaks
". ", # Sentence boundaries
" ", # Word boundaries (fallback)
]
splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap,
separators=separators,
length_function=len,
)
chunks = splitter.split_text(text)
# Attach positional metadata for source citation
return [
{
"content": chunk,
"chunk_index": i,
"char_start": text.find(chunk),
}
for i, chunk in enumerate(chunks)
]Two things matter here that most implementations skip:
Overlap is not optional. 150-token overlap means the end of one chunk appears at the start of the next. A query that falls on a boundary now has two chances to retrieve the relevant content. Without overlap, boundary queries fail silently.
Structural separators over fixed counts. For documents with natural structure (markdown, legal docs with section numbering, financial reports), prefer splitting at structural boundaries. A section that starts at token 600 and ends at token 1100 should be one chunk, not two halves.
For very long sections that can't be cleanly split, we use a parent-child chunking pattern: embed small child chunks for retrieval precision, but return the parent section as context. This preserves recall without flooding the prompt with irrelevant content.
Embedding Model Selection
The default answer is text-embedding-3-large from OpenAI. It's performant, hosted, and good enough for most English-language business documents.
The honest answer is: it depends on your domain and query distribution.
OpenAI text-embedding-3-large — best default for general English documents. Strong on conversational queries, good at semantic similarity. Latency is fine. Cost adds up at scale (millions of chunks).
text-embedding-3-small — 5x cheaper, marginal quality drop for most use cases. We use this for large corpora where cost matters and the document domain is not highly technical.
Open-source (e.g., bge-large-en-v1.5, e5-mistral) — competitive quality, zero per-call cost once hosted. Worth it if you're running high query volume or have data residency requirements. The ops burden of self-hosting is real, so only go this route if you have the infrastructure.
Fine-tuned embeddings — worth it when your domain has specialized vocabulary that general-purpose embeddings handle poorly. We've seen meaningful recall improvements on legal and medical corpora with domain-adapted models. Cost is the fine-tuning run plus ongoing hosting; return is only there if general embeddings are measurably failing you.
Our approach: start with OpenAI, measure recall with your eval set (more on that below), and only switch models when you have data showing the switch is worth it.
Re-ranking: Worth It or Overkill?
A cross-encoder re-ranker (Cohere's Rerank API, or bge-reranker-large) takes the top-k retrieved chunks and scores them against the original query using a model that sees both simultaneously — not just their embeddings. It reorders results by actual relevance.
It's worth it when:
- ─Your top-1 retrieval precision is below ~70% on your eval set
- ─Your queries are long or complex (multi-part questions, conditions)
- ─The corpus has high semantic overlap (many chunks that "look" similar to a query but aren't actually relevant)
It's overkill when:
- ─Your corpus is small (fewer than 50k chunks) and retrieval is already precise
- ─Latency is critical and you can't afford the extra 200–400ms
- ─Queries are short and unambiguous
For the legal document system we built, re-ranking was essential. Contract language has high lexical similarity between clauses. A query about "termination for convenience" would surface "termination for cause" clauses in the top-k without re-ranking. With re-ranking, precision improved significantly across our eval set.
Hybrid Search: When Semantic Isn't Enough
Pure semantic search fails on exact terms. A user searching for "Section 12.4(b)" or "SKU-88291" doesn't need semantic similarity — they need exact match. BM25 handles this. Semantic search doesn't.
Hybrid search combines dense vector retrieval with sparse BM25 keyword retrieval and merges results using Reciprocal Rank Fusion (RRF). Qdrant and Weaviate support this natively. For Pinecone, you need to run BM25 separately and merge manually.
We default to hybrid search on any corpus where users might query by:
- ─Specific identifiers (contract numbers, SKUs, case numbers)
- ─Proper nouns that embeddings may not capture well
- ─Exact quoted phrases from documents
The overhead is modest and the coverage improvement on edge cases is consistent.
Hallucination Controls
Two mechanisms that matter in production:
Source attribution with citation enforcement. Every answer must cite the chunk(s) it draws from. We enforce this via structured output: the LLM returns a JSON object with answer and sources (list of chunk IDs). If sources are empty, the answer is rejected. This creates a forcing function — the model can't answer without grounding.
Confidence thresholds on retrieval. Before the LLM sees anything, we check the retrieval quality. If the top retrieved chunk has a similarity score below a threshold (we tune this per-deployment, typically 0.72–0.78), we return a "I don't have enough information to answer this reliably" response rather than passing weak context to the LLM. This is uncomfortable for clients at first — "why won't it answer?" — but it's better than confident wrong answers.
Evaluation: How to Know If Your RAG Actually Works
You cannot improve what you don't measure. RAG systems that aren't evaluated are systems that silently degrade as document corpora change and query distributions shift.
The minimal eval setup we run on every production RAG:
- ─Retrieval precision@k — for a labeled test set of (query, expected_chunk_id) pairs, what fraction of queries return the relevant chunk in the top-k? We target >80% at top-3.
- ─Answer faithfulness — does the LLM's answer contradict or fabricate anything not in the retrieved context? We use a lightweight LLM judge for this, evaluated on a random sample.
- ─Answer relevance — does the answer actually address the query? Also LLM-judged.
- ─No-answer rate — what fraction of queries hit the low-confidence threshold and return a fallback? This should be low for queries in-domain, and that rate should be monitored over time.
We run retrieval precision evals on every deployment and after any corpus update. The others run weekly on a sampled query log from production.
RAG Is a Retrieval Engineering Problem
The failure mode most teams fall into is treating RAG as an LLM problem — swapping models, tuning prompts, adding chain-of-thought — when the actual failure is in the retrieval layer. The LLM is only as good as the context it receives. If the wrong chunks come back, no amount of prompt engineering recovers the answer.
Invest the engineering effort where the leverage is:
- ─Chunking strategy matched to your document structure
- ─Hybrid search if your queries include exact terms
- ─Re-ranking if retrieval precision is the bottleneck
- ─Evaluation infrastructure from day one
The LLM at the end is almost the easiest part. Get the retrieval right first.
If you're building a RAG system and running into production issues, we're happy to dig in with you.