Insights / Engineering / Retrieval Quality: Evaluating RAG with Citations You Can Defend

Engineering06 May 2026 · 8 min read

Retrieval Quality: Evaluating RAG with Citations You Can Defend

A RAG system that cites the wrong passage is worse than one that admits it does not know. Here is how to evaluate retrieval quality and build citations that survive scrutiny.

Tumiso Graig Ramaboya

Founder, CEO & POPIA Information Officer

sonofgraig Insights cover on evaluating RAG retrieval quality and defensible citations, over a blue node-lattice motif.

The failure mode that destroys trust in a RAG system is not the answer it refuses to give — it is the confident answer attached to the wrong citation. A user who clicks a citation and finds it does not support the claim stops trusting every answer afterward. Retrieval quality is therefore not an optimisation; it is the product. This piece is about how to measure it and how to make citations defensible.

Retrieval is the product, not the prompt

In a RAG system, answer quality is bounded by retrieval quality: the model can only be as right as the passages it was given. If the right passage is not retrieved, no prompt engineering recovers it. So the highest-leverage work is the retrieval layer — getting the correct, relevant passages into the context for every query — not refining the instruction that wraps them.

Why hybrid search beats pure vector search

Dense vector search alone misses things that matter in enterprise documents: exact policy numbers, case IDs, product codes, and proper nouns, where a near-synonym is wrong. Lexical search (BM25) catches those exact matches but misses paraphrases. Hybrid search runs both and fuses the results, and on real enterprise corpora it reliably beats either alone. A reranker on top — scoring the fused candidates for relevance to the specific query — lifts quality further, especially on long or multi-part questions.

The metrics that actually matter

Evaluate RAG on three axes, not one. Context relevance: were the retrieved passages actually relevant to the question? Faithfulness: is the answer supported by the retrieved passages, or did the model add unsupported claims? Answer relevance: did the answer address what was asked? A system can score well on one and fail another — high faithfulness but low context relevance means it faithfully summarised the wrong documents. Tools like Ragas operationalise these as automated scores.

What makes a citation defensible

A defensible citation points to the specific passage that supports the claim, not the whole document, and the supported claim is genuinely contained in that passage. Two engineering practices get you there: chunk documents so a citation resolves to a readable, self-contained span (not a 40-page PDF), and enforce faithfulness so the model is constrained to claims its retrieved context supports. The acid test is the click-through: a reviewer clicks the citation and immediately sees the sentence that backs the claim.

Treat retrieval like code: regression-test it

Retrieval quality drifts as the corpus grows, the chunking changes, or the embedding model is swapped. Without a regression suite — a fixed set of questions with known-good passages, scored on every change — you discover degradation in production. Build the evaluation harness before the first production deployment, not after the first complaint. This is the same discipline that separates demo agents from production ones, covered in the agents piece.

Why this is also a compliance property

Defensible citations are not only a trust feature — they are a compliance asset. In a regulated setting, "the system said so" is not an acceptable basis for a decision; "here is the source passage the answer is grounded in" is. Citation quality is what lets a RAG answer survive an audit, which connects directly to the broader POPIA posture in the compliance checklist.

How we build and measure this

Our RAG delivery ships with hybrid retrieval, source citations on every answer, and a Ragas evaluation harness with regression tests from day one — the full scope is in RAG Knowledge Base Setup, productised as RAG Studio. For the decision of when RAG is even the right tool, see RAG vs fine-tuning.