A practical field manual for engineering teams who want AI features that survive real users, incidents, and budgets — not just demo day.
Most AI initiatives don't fail because the model is bad. They fail because the surrounding system is missing: no clear contracts, no evaluation, no governance, and no path from one clever notebook to a repeatable product capability.
This guide is a field manual for engineering teams who want AI features that survive real users, real incidents, and real budgets — not just demo day.
An AI contract is a one-pager every feature must have before any prompt work begins. It defines:
sanitized Markdown, max 4k chars).P95 ≤ 2.5s.≤ $0.005 per request at forecast traffic.A team shipping an “AI incident summary” feature skipped this step. They let the model “summarize any incident data” with no schema. Within weeks:
They rewrote the feature around a contract:
Task: Summarize a production incident for engineers.
Inputs: Root cause, impact, timeline (ISO timestamps), remediation, incident tags.
Output: JSON with fields:title,short_summary(≤ 240 chars),long_summary(≤ 800 chars),impact_level(LOW|MEDIUM|HIGH).
Quality: 90% of summaries accepted without edit by on-call engineer.
Latency: P95 ≤ 2s.
Cost: ≤ $0.003 per incident.
Breakage stopped overnight because downstream code finally knew what to expect.
docs/ai/contracts/incident-summary.md) and require it in PRs for new AI features.Prompts live in a vendor UI, people tweak them during incidents, and nobody knows why behavior changed last week.
Real-world example
A support team had a “reply draft” assistant. A well-meaning engineer changed the tone phrasing in the prompt from “professional and concise” to “friendly and conversational.” Conversion dipped on enterprise accounts — customers started complaining about “too casual” responses — but there was:
prompts/support/reply-draft.prompt.md).SUPPORT_REPLY_V3.prompts/ folder with:
*.prompt.md — raw prompt text.*.policy.md — safety/guardrail instructions.CHANGELOG.md — bullet list of meaningful changes.For retrieval-augmented generation (RAG), prompt tuning on top of bad retrieval is like polishing a cracked lens.
A team shipped a RAG “runbook assistant” for on-call engineers. It hallucinated outdated mitigation steps. Root cause was not the model; it was:
After they fixed retrieval:
Hallucinations dropped by more than 40%, with no model change.
Goal: your system never crashes because of AI output shape.
Goal: ensure the model is “good enough” for key tasks.
Goal: check end-to-end behavior in real user flows.
Example: for an “incident summary + Slack notification” flow, tests should verify:
Accuracy alone is vanity if latency, cost, or policy risk are unacceptable.
A customer-facing “architecture review assistant” had great semantic scores but:
Once they built a scorecard per release, they exposed:
They then explicitly traded a slight drop in accuracy for 2× faster responses and 50% lower cost, which turned out to be the right business call.
You don’t need a committee; you need clear ownership.
Keep a single ai/systems.yaml file listing:
incident_summary_v2).This file becomes your source of truth during incidents, audits, and vendor changes.
Blindly using the “best” model for everything is wasteful and risky.
Implement routing via a config file (for example, ai/routing.json) rather than conditionals scattered through code. That makes it easy to change behavior without redeploying everything.
In an internal incident tool, adding a small “What data was used?” toggle reduced complaints about hallucinations. Engineers could see exactly which runbooks and incidents the assistant used.
Assume:
A team added a “deploy with AI” helper. They blocked direct shell execution and forced:
This added 1–2 seconds but avoided an entire class of “the model misread the ticket” incidents.
ai/runbooks/*.md for:
Over time, this turns AI work from “clever prompts” into a disciplined engineering practice — the real competitive advantage in 2026.
ai/systems.yaml and routing config.If you do only this, you’ll already be ahead of most teams shipping AI in production.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
AI Inference Cost Optimization. Practical guidance for reliable, scalable platform operations.
A practical production playbook for AI systems: evaluation gates, guardrails, observability, cost control, and reliable release management.
Explore more articles in this category
Battle-tested prompt patterns from running LLM features in production: structured output, chain-of-thought, and graceful failure handling.
A real-world model fallback guide for customer-facing AI systems, covering how one team preserved response quality and support SLAs during a partial provider degradation.
A practical embedding model upgrade guide for RAG systems, built from a real support-search migration that initially reduced answer quality instead of improving it.