We invalidate ~6% of LLM outputs before they reach a downstream system. Here's how we structure prompts and validators to catch malformed responses early.

LLM Output Validation: Schema-First Prompt Engineering Patterns

Of every 1,000 LLM completions our pipelines emit, approximately 63 are syntactically or semantically invalid — broken JSON, missing fields, fields filled with nonsense values that look right but aren't. We catch all of them before they reach a downstream system. Here's the pattern, the schema-first prompt structures, and the retry logic that gets us to four nines of correct outputs.

The Workload #

~190k LLM calls/day across pipelines
Use cases: classification (highest volume), structured extraction, tool selection
Models: gpt-4o-mini for high volume, gpt-4o for complex extractions
Output format: JSON only. We never use free-form prose where a structure could be enforced

Layer 1: Prompt Structure That Anchors The Output #

The prompt explicitly shows the schema the model should produce. Three patterns we use:

Pattern A: Inline JSON Schema #

python.python

SCHEMA = {
    "type": "object",
    "required": ["category", "confidence", "reasoning"],
    "properties": {
        "category":   {"enum": ["billing", "technical", "account", "other"]},
        "confidence": {"type": "number", "minimum": 0.0, "maximum": 1.0},
        "reasoning":  {"type": "string", "minLength": 10, "maxLength": 200}
    }
}

PROMPT = f"""Classify the customer message below.

Output JSON matching this schema EXACTLY. No prose before or after the JSON.

Schema:
{json.dumps(SCHEMA, indent=2)}

Message:
{user_message}
"""

This is the highest-quality pattern. Not because the model can't be unfaithful, but because review and validation become trivial when the spec is visible alongside the output.

Pattern B: Examples Over Description #

For non-trivial extractions, examples beat descriptions:

python.python

PROMPT = """Extract product mentions from text. Output JSON.

Example:
Input: "I love my MacBook Pro M4 and the iPhone 16 case I bought."
Output: {"products": [
  {"name": "MacBook Pro", "version": "M4", "type": "laptop"},
  {"name": "iPhone 16 case", "version": null, "type": "accessory"}
]}

Example:
Input: "Visited the store, didn't buy anything."
Output: {"products": []}

Now extract from:
""" + user_text

Two examples cover most edge cases. We rarely use more than three.

Pattern C: Provider's Structured Output API #

OpenAI's response_format={"type": "json_schema", ...} constrains the decoder to produce valid JSON matching a schema. We use this for the highest-volume routes.

python.python

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=messages,
    response_format={
        "type": "json_schema",
        "json_schema": {"name": "classification", "strict": True, "schema": SCHEMA},
    },
)

This eliminates syntactic failures (broken JSON) entirely. Semantic failures still happen — the model writes valid JSON with wrong content — so validation downstream is still required.

Layer 2: Pydantic Validators (Or Equivalent)#

Every LLM output passes through a strict Pydantic model before any downstream code touches it.

python.python

from pydantic import BaseModel, Field, field_validator
from enum import Enum

class Category(str, Enum):
    BILLING = "billing"
    TECHNICAL = "technical"
    ACCOUNT = "account"
    OTHER = "other"

class Classification(BaseModel):
    category: Category
    confidence: float = Field(ge=0.0, le=1.0)
    reasoning: str = Field(min_length=10, max_length=200)

    @field_validator("reasoning")
    @classmethod
    def reasoning_must_be_meaningful(cls, v: str) -> str:
        # Catches the failure mode "I will classify this message"
        bad_patterns = ["I will", "Let me", "Sure,"]
        for p in bad_patterns:
            if v.startswith(p):
                raise ValueError(f"reasoning starts with non-substantive phrase: {p}")
        return v

The custom validator on reasoning is the kind of thing that comes from production experience. The model occasionally produces "Let me think about this..." in fields that should contain conclusions. The validator catches it.

Layer 3: Retry With Targeted Repair #

When validation fails, we don't just retry. We send the error back to the model:

python.python

class LLMValidationError(Exception):
    pass

def call_with_validation(prompt: str, schema: type[BaseModel], max_retries: int = 2) -> BaseModel:
    messages = [{"role": "user", "content": prompt}]
    for attempt in range(max_retries + 1):
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=messages,
            response_format={"type": "json_object"},
        )
        raw = response.choices[0].message.content
        try:
            return schema.model_validate_json(raw)
        except (ValueError, ValidationError) as e:
            if attempt >= max_retries:
                raise LLMValidationError(f"failed after {max_retries+1} attempts: {e}")
            # Targeted repair prompt
            messages.append({"role": "assistant", "content": raw})
            messages.append({
                "role": "user",
                "content": f"Your output failed validation: {e}. Fix and reply with corrected JSON only.",
            })

Showing the model its own previous output plus the specific error dramatically improves the recovery rate compared to a fresh retry.

In our metrics:

First-attempt success: 94%
After one retry: 99.2%
After two retries: 99.7%
Hard failure: 0.3% (we treat these as actual application errors)

Layer 4: Fail-Open vs Fail-Closed #

When all retries fail, what happens?

Fail-closed (we use for): classification, routing decisions, anything that gates user experience. The pipeline returns an error; user sees a generic "we couldn't categorize this — please pick from the list" UI.

Fail-open (we use for): enrichment, optional metadata, anything that's nice-to-have. The pipeline returns null; downstream code handles the absence.

The choice depends on what's worse: a confident-but-wrong categorization, or no categorization. For us, the former is much worse.

Validators That Catch Real Bugs #

Beyond schema validation, we have semantic validators that have caught real production issues:

Bounded numeric outputs #

python.python

class PricingDecision(BaseModel):
    discount_percent: float = Field(ge=0, le=50)  # never give > 50% off

Caught a model that emitted discount_percent: 95 for an edge-case prompt. Without the validator, that would have been a billing incident.

Internal-consistency #

python.python

class ExtractedAddress(BaseModel):
    street: str
    city: str
    state: str
    country: str

    @model_validator(mode="after")
    def state_required_if_us(self):
        if self.country == "US" and not self.state:
            raise ValueError("US addresses must have state")
        return self

Catches partial extractions. The pipeline retries with a corrective prompt.

Reference data check #

python.python

KNOWN_PRODUCTS = set(load_product_catalog())

class ProductMention(BaseModel):
    name: str
    sku: str

    @field_validator("sku")
    @classmethod
    def sku_must_exist(cls, v: str) -> str:
        if v not in KNOWN_PRODUCTS:
            raise ValueError(f"unknown SKU: {v}")
        return v

Caught the model occasionally inventing plausible-looking SKUs. Validator forces the model to stick to the catalog.

Cost of Validation #

Validation isn't free. Each failed-and-retried call costs roughly the same as a successful call. Our cost overhead from retries is ~6% above raw success-only call volume.

That's far cheaper than the alternative — corrupted downstream state — and 6% is a budget worth paying.

What Surprises People #

1. Model errors cluster.#

When validation fails, it usually fails on the same kinds of inputs. We log every validation failure with the input. Quarterly review of these logs has produced three big improvements:

A reworded prompt example
A field renamed because the model kept generating the legacy name
A fallback path for inputs that legitimately don't fit the schema

2. Smaller models fail more on validation than on quality.#

gpt-4o-mini produces roughly equally accurate classifications to gpt-4o for our task. But mini produces invalid JSON about 3× more often. Strict structured outputs help.

3. Streaming complicates validation.#

If you stream tokens to the user UI, you can't validate before the user sees text. We render tentatively, then revalidate at end-of-stream and post a correction if needed. Honest, but ugly. We avoid streaming for any pipeline where final-output correctness matters.

The Pattern, In Order #

For any new LLM pipeline:

Define the output schema as a Pydantic model.
Show the model that schema (or use the structured-output API).
Provide 1–3 examples of input → output.
Validate the response with the Pydantic model + custom semantic validators.
On validation failure, retry with the error message in context.
Choose fail-open vs fail-closed deliberately.
Log every validation failure for quarterly review.

Best Practices Summary #

Schema-first prompts beat schema-after prompts. Anchoring to a schema in the prompt is the single biggest quality lever.
Use the provider's structured output mode when available. Eliminates syntactic failures.
Always validate semantics, not just syntax. Valid JSON with wrong content is the failure mode that matters.
Send errors back into the model. Targeted repair beats blind retries.
Cap retries at 2. Beyond that, you're wasting money on a query the model can't answer.
Log validation failures. Patterns emerge over weeks; act on them.

When You Don't Need This #

One-off prompts in a notebook? Skip it.
Pure prose responses where structure isn't applicable? Skip it.
Internal tools with humans in the loop catching errors? You can probably skip it.

For any production system that consumes LLM output programmatically, schema-first validation is the difference between four-nines reliability and a steady stream of small downstream corruptions you'll spend years explaining.

LLM Output Validation: Schema-First Prompt Engineering Patterns

LLM Output Validation: Schema-First Prompt Engineering Patterns

The Workload #

Layer 1: Prompt Structure That Anchors The Output #

Pattern A: Inline JSON Schema #

Pattern B: Examples Over Description #

Pattern C: Provider's Structured Output API #

Layer 2: Pydantic Validators (Or Equivalent)#

Layer 3: Retry With Targeted Repair #

Layer 4: Fail-Open vs Fail-Closed #

Validators That Catch Real Bugs #

Bounded numeric outputs #

Internal-consistency #

Reference data check #

Cost of Validation #

What Surprises People #

1. Model errors cluster.#

2. Smaller models fail more on validation than on quality.#

3. Streaming complicates validation.#

The Pattern, In Order #

Best Practices Summary #

When You Don't Need This #

Stay Updated

Argo Rollouts: Canary Deployments That Caught a $40k Bug

eBPF for SREs: Three Real Diagnoses That Saved Hours

More from AI

Production RAG Reliability — Making LLM Answers Trustworthy

Shadow Testing and Canary Releases for LLM Changes

Debugging RAG Retrieval — Why It Returns Garbage

Production RAG Reliability — Making LLM Answers Trustworthy

Shadow Testing and Canary Releases for LLM Changes

Debugging RAG Retrieval — Why It Returns Garbage

Long Context vs RAG — When to Use Which

Prompt Injection Defense for LLM Apps

RAG Evaluation Metrics — Faithfulness and Context Precision

You might have missed

GitOps with Argo CD: Best Practices for 2025

Process Management and Monitoring in Linux

Linux Performance Tuning for Containers and Kubernetes Nodes

About Kiril Urbonas