_d
devops/ness
Blog
Reading ListAbout
Subscribe
Featured Article

Model Fallback Policies for Customer-Facing AI: The Routing Rules That Kept SLA Intact

A real-world model fallback guide for customer-facing AI systems, covering how one team preserved response quality and support SLAs during a partial provider degradation.

AILLMGPTMonitoring
KU
Kiril urbonasDevOps Engineer and AI Enthusiast
|Mar 27, 2026
Model Fallback Policies for Customer-Facing AI: The Routing Rules That Kept SLA Intact

Topics

Monitoring287Terraform210AWS171Kubernetes126Python114Security112CI/CD107LLM101Ansible98Linux98

Latest Articles

View All →
Artifact Promotion Instead of Rebuilds: The Release Control Pattern That Stopped Drift
••yesterday

Artifact Promotion Instead of Rebuilds: The Release Control Pattern That Stopped Drift

A practical artifact promotion guide for CI/CD teams that were tired of hearing 'it passed in staging' after production behaved differently because the release was rebuilt.

KU
Kiril urbonas·4 min read
Read article
RDS Restore Drills for Busy Teams: The Recovery Workflow That Surfaced Real Gaps
••2 days ago

RDS Restore Drills for Busy Teams: The Recovery Workflow That Surfaced Real Gaps

A hands-on RDS restore drill guide for small cloud teams that thought backups were covered until a timed restore test exposed missing steps, DNS confusion, and stale credentials.

KU
Kiril urbonas·4 min read
Read article
Page 1 of 45 · 535 posts
Previous
12...45
Next

DevOpsNess

Practical AI, DevOps, Cloud, and Linux guidance for engineering teams

Weekly deep dives, implementation patterns, and reliability-focused playbooks.

Join NewsletterBrowse Posts
_d
devops/ness

A practical blog covering AI, cloud, DevOps, and modern technology for engineering teams.

Explore

  • Latest Articles
  • Archive
  • Reading List

Resources

  • About
  • RSS Feed
  • Newsletter

Legal

Systemd Drop-In Overrides for Vendor Services: The Supportable Linux Ops Pattern
••3 days ago

Systemd Drop-In Overrides for Vendor Services: The Supportable Linux Ops Pattern

A practical systemd drop-in guide built from a real operations problem: vendor unit files kept changing, but the team still needed consistent restart, environment, and logging behavior.

KU
Kiril urbonas·4 min read
Read article
Terraform Module Version Pinning: How One Platform Team Stopped Surprise Breakage
••4 days ago

Terraform Module Version Pinning: How One Platform Team Stopped Surprise Breakage

A real-world Terraform module version pinning guide for platform teams that want safer upgrades, clearer ownership, and fewer broken pipelines after shared module releases.

KU
Kiril urbonas·4 min read
Read article
Embedding Model Upgrades Without Search Chaos: A Safer RAG Rollout Pattern
••5 days ago

Embedding Model Upgrades Without Search Chaos: A Safer RAG Rollout Pattern

A practical embedding model upgrade guide for RAG systems, built from a real support-search migration that initially reduced answer quality instead of improving it.

KU
Kiril urbonas·4 min read
Read article
Multi-Cluster Traffic Routing Strategies: A Pragmatic Rollout Pattern for Growing SaaS Teams
••6 days ago

Multi-Cluster Traffic Routing Strategies: A Pragmatic Rollout Pattern for Growing SaaS Teams

A real-world multi-cluster traffic routing guide for SaaS teams that have outgrown a single Kubernetes cluster and need safer rollout control without a service-mesh science project.

KU
Kiril urbonas·4 min read
Read article
Terraform State Isolation by Environment: How We Stopped One Change from Hitting Prod
••last week

Terraform State Isolation by Environment: How We Stopped One Change from Hitting Prod

A practical Terraform state isolation guide built from a real environment-mixing incident, with patterns for safer backends, clearer ownership, and lower blast radius.

KU
Kiril urbonas·3 min read
Read article
Prompt Versioning and Regression Testing: How Teams Avoid Silent AI Regressions
••last week

Prompt Versioning and Regression Testing: How Teams Avoid Silent AI Regressions

A real-world guide to prompt versioning and regression testing for production AI features, focused on preventing the subtle changes that hurt quality long before anyone notices.

KU
Kiril urbonas·3 min read
Read article
Systemd Service Reliability Patterns: What We Changed After Repeated Restart Loops
••last week

Systemd Service Reliability Patterns: What We Changed After Repeated Restart Loops

A practical systemd reliability guide for Linux services, built around repeated restart-loop incidents and the unit-file patterns that finally made those services boring.

KU
Kiril urbonas·3 min read
Read article
Blue-Green Deployment Guardrails in Kubernetes: Lessons from a Failed Friday Rollout
••last week

Blue-Green Deployment Guardrails in Kubernetes: Lessons from a Failed Friday Rollout

A Kubernetes blue-green deployment guide built around a real rollout failure, showing the guardrails that matter when traffic shifting, health checks, and rollback timing all interact.

KU
Kiril urbonas·3 min read
Read article
Cloud Disaster Recovery Runbook Design: How Small Teams Rehearse Multi-Region Failover
••last week

Cloud Disaster Recovery Runbook Design: How Small Teams Rehearse Multi-Region Failover

A practical disaster recovery runbook guide for small cloud teams that need realistic failover steps, clear ownership, and repeatable rehearsals instead of shelfware documents.

KU
Kiril urbonas·4 min read
Read article
RAG Retrieval Quality Evaluation: The Checks We Added After Bad Answers Reached Production
••last week

RAG Retrieval Quality Evaluation: The Checks We Added After Bad Answers Reached Production

A search-friendly guide to RAG retrieval quality evaluation, based on the moment one production assistant started citing stale documents and the team had to prove what 'good retrieval' meant.

KU
Kiril urbonas·3 min read
Read article
  • Privacy
  • Terms

© 2026 DevOpsNess. By Kiril Urbonas.

RSSPrivacyTerms