A field report from rolling out retrieval-augmented generation in production, including cache bugs, bad embeddings, and how we fixed them.

On this page

Real-World RAG Incidents: Lessons from a Production Rollout

When we first rolled out a RAG-based assistant for our internal SRE team, nothing in the vendor docs really prepared us for the messy parts.

Incident: Cached Wrong Answers #

The first painful incident happened on a Monday morning. A runbook query returned an outdated PostgreSQL failover procedure because:

We cached answers aggressively to save tokens.
The underlying runbook in Git had been updated over the weekend.
Our invalidation logic only watched the vector store, not the source repo.

How We Fixed It #

We changed our cache key to include the document commit hash.
We added a background job that compares Git commits against stored vectors.
We updated the runbook template to show the last updated date in the answer.

Incident: Embeddings Going Silent #

Two weeks later, we saw a spike in “no relevant context found” errors during incident calls. The vector DB was healthy; the problem turned out to be:

A new data source with HTML-heavy content.
We were chunking purely by character count.
The relevant text was split across three different chunks.

Changes We Made #

Switched to semantic + heading based chunking with overlap.
Added a metric for “chunks per query” and “distance of top-1 match”.
Logged a sample of low-quality retrievals for manual review.

Checklist for RAG in Production #

Track cache hit rate, LLM error rate, and no-context rate.
Store the retrieved chunk IDs alongside each answer.
Regularly sample answers and review them with the owning team.

The marketing pages sold RAG as magic. In reality it behaves more like a database: if you don’t design for drift, invalidation, and observability, it will betray you at the worst moment.

Real-World RAG Incidents: Lessons from a Production Rollout

Real-World RAG Incidents: Lessons from a Production Rollout

Incident: Cached Wrong Answers #

How We Fixed It #

Incident: Embeddings Going Silent #

Changes We Made #

Checklist for RAG in Production #

Stay Updated

Architecture Review: Python Worker Queue Scaling Patterns

Service Mesh Implementation: Istio vs Linkerd

More from AI

Self-Hosted LLMs vs OpenAI API: A Cost-vs-Latency Analysis After 6 Months

Embedding Quality in RAG: How We Cut Hallucinations by 60%

Prompt Engineering Patterns That Actually Work in Production

Self-Hosted LLMs vs OpenAI API: A Cost-vs-Latency Analysis After 6 Months

Embedding Quality in RAG: How We Cut Hallucinations by 60%

Prompt Engineering Patterns That Actually Work in Production

Model Fallback Policies for Customer-Facing AI: The Routing Rules That Kept SLA Intact

Embedding Model Upgrades Without Search Chaos: A Safer RAG Rollout Pattern

Prompt Versioning and Regression Testing: How Teams Avoid Silent AI Regressions

About Kiril urbonas

You might have missed

GitOps with Argo CD: Best Practices for 2025

Best Practices: Cloud Disaster Recovery Runbook Design

Deep Dive: GitHub Actions Pipeline Reliability

Real-World RAG Incidents: Lessons from a Production Rollout

Incident: Cached Wrong Answers#

How We Fixed It#

Incident: Embeddings Going Silent#

Changes We Made#

Checklist for RAG in Production#

Stay Updated

Architecture Review: Python Worker Queue Scaling Patterns

Service Mesh Implementation: Istio vs Linkerd

More from AI

Self-Hosted LLMs vs OpenAI API: A Cost-vs-Latency Analysis After 6 Months

Embedding Quality in RAG: How We Cut Hallucinations by 60%

Prompt Engineering Patterns That Actually Work in Production

About Kiril urbonas

You might have missed

GitOps with Argo CD: Best Practices for 2025

Best Practices: Cloud Disaster Recovery Runbook Design

Deep Dive: GitHub Actions Pipeline Reliability

Incident: Cached Wrong Answers #

How We Fixed It #

Incident: Embeddings Going Silent #

Changes We Made #

Checklist for RAG in Production #