Battle-tested prompt patterns from running LLM features in production: structured output, chain-of-thought, and graceful failure handling.
After running LLM-powered features for 8 months in production, these are the patterns that survived contact with real users and messy data.
Asking an LLM to "return JSON" works 90% of the time. The other 10% crashes your parser at 2 AM.
What we do:
import json
from pydantic import BaseModel
class ExtractedEntity(BaseModel):
name: str
category: str
confidence: float
SYSTEM_PROMPT = """Extract entities from the text.
Return ONLY valid JSON matching this schema:
{"name": string, "category": string, "confidence": number 0-1}
Return an array. No explanation, no markdown fences."""
def extract_entities(text: str) -> list[ExtractedEntity]:
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": text},
],
temperature=0.1,
)
raw = response.choices[0].message.content.strip()
# Strip markdown fences if the model adds them anyway
if raw.startswith("```"):
raw = raw.split("\n", 1)[1].rsplit("```", 1)[0]
data = json.loads(raw)
return [ExtractedEntity(**item) for item in data]
Why it works: Low temperature, explicit schema in the prompt, and a defensive parser that handles the most common failure mode (markdown fences).
For classification tasks with nuance, asking the model to think step-by-step improved accuracy from 78% to 91%.
Classify this support ticket. Think step by step:
1. What product area does this relate to?
2. Is this a bug report, feature request, or question?
3. What is the urgency (low/medium/high)?
Then return your answer as JSON: {"area": ..., "type": ..., "urgency": ...}
Key insight: The reasoning steps aren't just for the model—they're also audit trails when a human reviews the classification.
LLM calls fail. Rate limits hit. Latency spikes. Your feature needs a fallback.
async def summarize_with_fallback(text: str) -> str:
try:
result = await call_llm(text, timeout=5.0)
return result
except (TimeoutError, RateLimitError):
# Fallback: first 200 chars + ellipsis
return text[:200].rsplit(" ", 1)[0] + "..."
except json.JSONDecodeError:
logger.warning("LLM returned unparseable response")
return "Summary unavailable"
Best practice: Every LLM call should have a timeout, a retry budget, and a non-LLM fallback.
Instead of a 500-word system prompt explaining the format, give 2-3 examples:
Convert the user message to a database query.
Example: "orders from last week" -> SELECT * FROM orders WHERE created_at > NOW() - INTERVAL '7 days'
Example: "top customers by revenue" -> SELECT customer_id, SUM(amount) as revenue FROM orders GROUP BY customer_id ORDER BY revenue DESC LIMIT 10
User: {user_message}
This is more reliable than describing the syntax rules in prose.
The models are impressive, but production reliability comes from everything around the model call.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
A real cost audit uncovered idle load balancers, oversized RDS instances, and forgotten snapshots. Here's what we found and how we fixed each one.
Step-by-step debugging of a production Linux server hitting 100% CPU. From top to perf to the actual fix.
Explore more articles in this category
We ran the same workload on both for half a year. The break-even point isn't where most blog posts say it is — and the latency story has more nuance than throughput-per-dollar charts admit.
Six months running RAG in production taught us that the retrieval step matters far more than the model. Concrete techniques that moved the needle, with before/after numbers.
A real-world model fallback guide for customer-facing AI systems, covering how one team preserved response quality and support SLAs during a partial provider degradation.
Evergreen posts worth revisiting.