A practical field manual for engineering teams who want AI features that survive real users, incidents, and budgets — not just demo day.
Most AI initiatives don't fail because the model is bad. They fail because the surrounding system is missing: no clear contracts, no evaluation, no governance, and no path from one clever notebook to a repeatable product capability.
This guide is a field manual for engineering teams who want AI features that survive real users, real incidents, and real budgets — not just demo day.
An AI contract is a one-pager every feature must have before any prompt work begins. It defines:
sanitized Markdown, max 4k chars).P95 ≤ 2.5s.≤ $0.005 per request at forecast traffic.A team shipping an “AI incident summary” feature skipped this step. They let the model “summarize any incident data” with no schema. Within weeks:
They rewrote the feature around a contract:
Task: Summarize a production incident for engineers.
Inputs: Root cause, impact, timeline (ISO timestamps), remediation, incident tags.
Output: JSON with fields:title,short_summary(≤ 240 chars),long_summary(≤ 800 chars),impact_level(LOW|MEDIUM|HIGH).
Quality: 90% of summaries accepted without edit by on-call engineer.
Latency: P95 ≤ 2s.
Cost: ≤ $0.003 per incident.
Breakage stopped overnight because downstream code finally knew what to expect.
docs/ai/contracts/incident-summary.md) and require it in PRs for new AI features.Prompts live in a vendor UI, people tweak them during incidents, and nobody knows why behavior changed last week.
Real-world example
A support team had a “reply draft” assistant. A well-meaning engineer changed the tone phrasing in the prompt from “professional and concise” to “friendly and conversational.” Conversion dipped on enterprise accounts — customers started complaining about “too casual” responses — but there was:
prompts/support/reply-draft.prompt.md).SUPPORT_REPLY_V3.prompts/ folder with:
*.prompt.md — raw prompt text.*.policy.md — safety/guardrail instructions.CHANGELOG.md — bullet list of meaningful changes.For retrieval-augmented generation (RAG), prompt tuning on top of bad retrieval is like polishing a cracked lens.
A team shipped a RAG “runbook assistant” for on-call engineers. It hallucinated outdated mitigation steps. Root cause was not the model; it was:
After they fixed retrieval:
Hallucinations dropped by more than 40%, with no model change.
Goal: your system never crashes because of AI output shape.
Goal: ensure the model is “good enough” for key tasks.
Goal: check end-to-end behavior in real user flows.
Example: for an “incident summary + Slack notification” flow, tests should verify:
Accuracy alone is vanity if latency, cost, or policy risk are unacceptable.
A customer-facing “architecture review assistant” had great semantic scores but:
Once they built a scorecard per release, they exposed:
They then explicitly traded a slight drop in accuracy for 2× faster responses and 50% lower cost, which turned out to be the right business call.
You don’t need a committee; you need clear ownership.
Keep a single ai/systems.yaml file listing:
incident_summary_v2).This file becomes your source of truth during incidents, audits, and vendor changes.
Blindly using the “best” model for everything is wasteful and risky.
Implement routing via a config file (for example, ai/routing.json) rather than conditionals scattered through code. That makes it easy to change behavior without redeploying everything.
In an internal incident tool, adding a small “What data was used?” toggle reduced complaints about hallucinations. Engineers could see exactly which runbooks and incidents the assistant used.
Assume:
A team added a “deploy with AI” helper. They blocked direct shell execution and forced:
This added 1–2 seconds but avoided an entire class of “the model misread the ticket” incidents.
ai/runbooks/*.md for:
Over time, this turns AI work from “clever prompts” into a disciplined engineering practice — the real competitive advantage in 2026.
ai/systems.yaml and routing config.If you do only this, you’ll already be ahead of most teams shipping AI in production.
AI Inference Cost Optimization. Practical guidance for reliable, scalable platform operations.
A practical production playbook for AI systems: evaluation gates, guardrails, observability, cost control, and reliable release management.
Explore more articles in this category
A field report from rolling out retrieval-augmented generation in production, including cache bugs, bad embeddings, and how we fixed them.
A field report from rolling out retrieval-augmented generation in production, including cache bugs, bad embeddings, and how we fixed them.
A practical production playbook for AI systems: evaluation gates, guardrails, observability, cost control, and reliable release management.