We invalidate ~6% of LLM outputs before they reach a downstream system. Here's how we structure prompts and validators to catch malformed responses early.
Of every 1,000 LLM completions our pipelines emit, approximately 63 are syntactically or semantically invalid — broken JSON, missing fields, fields filled with nonsense values that look right but aren't. We catch all of them before they reach a downstream system. Here's the pattern, the schema-first prompt structures, and the retry logic that gets us to four nines of correct outputs.
gpt-4o-mini for high volume, gpt-4o for complex extractionsThe prompt explicitly shows the schema the model should produce. Three patterns we use:
SCHEMA = {
"type": "object",
"required": ["category", "confidence", "reasoning"],
"properties": {
"category": {"enum": ["billing", "technical", "account", "other"]},
"confidence": {"type": "number", "minimum": 0.0, "maximum": 1.0},
"reasoning": {"type": "string", "minLength": 10, "maxLength": 200}
}
}
PROMPT = f"""Classify the customer message below.
Output JSON matching this schema EXACTLY. No prose before or after the JSON.
Schema:
{json.dumps(SCHEMA, indent=2)}
Message:
{user_message}
"""
This is the highest-quality pattern. Not because the model can't be unfaithful, but because review and validation become trivial when the spec is visible alongside the output.
For non-trivial extractions, examples beat descriptions:
PROMPT = """Extract product mentions from text. Output JSON.
Example:
Input: "I love my MacBook Pro M4 and the iPhone 16 case I bought."
Output: {"products": [
{"name": "MacBook Pro", "version": "M4", "type": "laptop"},
{"name": "iPhone 16 case", "version": null, "type": "accessory"}
]}
Example:
Input: "Visited the store, didn't buy anything."
Output: {"products": []}
Now extract from:
""" + user_text
Two examples cover most edge cases. We rarely use more than three.
OpenAI's response_format={"type": "json_schema", ...} constrains the decoder to produce valid JSON matching a schema. We use this for the highest-volume routes.
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=messages,
response_format={
"type": "json_schema",
"json_schema": {"name": "classification", "strict": True, "schema": SCHEMA},
},
)
This eliminates syntactic failures (broken JSON) entirely. Semantic failures still happen — the model writes valid JSON with wrong content — so validation downstream is still required.
Every LLM output passes through a strict Pydantic model before any downstream code touches it.
from pydantic import BaseModel, Field, field_validator
from enum import Enum
class Category(str, Enum):
BILLING = "billing"
TECHNICAL = "technical"
ACCOUNT = "account"
OTHER = "other"
class Classification(BaseModel):
category: Category
confidence: float = Field(ge=0.0, le=1.0)
reasoning: str = Field(min_length=10, max_length=200)
@field_validator("reasoning")
@classmethod
def reasoning_must_be_meaningful(cls, v: str) -> str:
# Catches the failure mode "I will classify this message"
bad_patterns = ["I will", "Let me", "Sure,"]
for p in bad_patterns:
if v.startswith(p):
raise ValueError(f"reasoning starts with non-substantive phrase: {p}")
return v
The custom validator on reasoning is the kind of thing that comes from production experience. The model occasionally produces "Let me think about this..." in fields that should contain conclusions. The validator catches it.
When validation fails, we don't just retry. We send the error back to the model:
class LLMValidationError(Exception):
pass
def call_with_validation(prompt: str, schema: type[BaseModel], max_retries: int = 2) -> BaseModel:
messages = [{"role": "user", "content": prompt}]
for attempt in range(max_retries + 1):
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=messages,
response_format={"type": "json_object"},
)
raw = response.choices[0].message.content
try:
return schema.model_validate_json(raw)
except (ValueError, ValidationError) as e:
if attempt >= max_retries:
raise LLMValidationError(f"failed after {max_retries+1} attempts: {e}")
# Targeted repair prompt
messages.append({"role": "assistant", "content": raw})
messages.append({
"role": "user",
"content": f"Your output failed validation: {e}. Fix and reply with corrected JSON only.",
})
Showing the model its own previous output plus the specific error dramatically improves the recovery rate compared to a fresh retry.
In our metrics:
When all retries fail, what happens?
Fail-closed (we use for): classification, routing decisions, anything that gates user experience. The pipeline returns an error; user sees a generic "we couldn't categorize this — please pick from the list" UI.
Fail-open (we use for): enrichment, optional metadata, anything that's nice-to-have. The pipeline returns null; downstream code handles the absence.
The choice depends on what's worse: a confident-but-wrong categorization, or no categorization. For us, the former is much worse.
Beyond schema validation, we have semantic validators that have caught real production issues:
class PricingDecision(BaseModel):
discount_percent: float = Field(ge=0, le=50) # never give > 50% off
Caught a model that emitted discount_percent: 95 for an edge-case prompt. Without the validator, that would have been a billing incident.
class ExtractedAddress(BaseModel):
street: str
city: str
state: str
country: str
@model_validator(mode="after")
def state_required_if_us(self):
if self.country == "US" and not self.state:
raise ValueError("US addresses must have state")
return self
Catches partial extractions. The pipeline retries with a corrective prompt.
KNOWN_PRODUCTS = set(load_product_catalog())
class ProductMention(BaseModel):
name: str
sku: str
@field_validator("sku")
@classmethod
def sku_must_exist(cls, v: str) -> str:
if v not in KNOWN_PRODUCTS:
raise ValueError(f"unknown SKU: {v}")
return v
Caught the model occasionally inventing plausible-looking SKUs. Validator forces the model to stick to the catalog.
Validation isn't free. Each failed-and-retried call costs roughly the same as a successful call. Our cost overhead from retries is ~6% above raw success-only call volume.
That's far cheaper than the alternative — corrupted downstream state — and 6% is a budget worth paying.
When validation fails, it usually fails on the same kinds of inputs. We log every validation failure with the input. Quarterly review of these logs has produced three big improvements:
gpt-4o-mini produces roughly equally accurate classifications to gpt-4o for our task. But mini produces invalid JSON about 3× more often. Strict structured outputs help.
If you stream tokens to the user UI, you can't validate before the user sees text. We render tentatively, then revalidate at end-of-stream and post a correction if needed. Honest, but ugly. We avoid streaming for any pipeline where final-output correctness matters.
For any new LLM pipeline:
For any production system that consumes LLM output programmatically, schema-first validation is the difference between four-nines reliability and a steady stream of small downstream corruptions you'll spend years explaining.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
A two-line config change to an Argo Rollouts analysis template caught a regression that would have cost ~$40k in API spend before we noticed. Here's the pattern.
We started using eBPF tooling for ad-hoc production debugging six months ago. Three real incidents where it cut investigation time from hours to minutes.
Explore more articles in this category
We've shipped all three patterns to production. They're not interchangeable. Here's the framework we now use to decide which approach fits a given task.
We ran the same RAG workload across three vector stores for a quarter each. Here's what we learned about latency, cost, and operational overhead.
We ran the same workload on both for half a year. The break-even point isn't where most blog posts say it is — and the latency story has more nuance than throughput-per-dollar charts admit.
Evergreen posts worth revisiting.