_d
devops/ness
Blog
Reading ListAbout
Subscribe
Featured Article

Monitoring That Actually Helps On-Call: Alerts, Dashboards, and Runbooks

How we went from 200 alerts per week (most ignored) to 15 actionable alerts with clear runbooks and useful dashboards.

InfrastructureMonitoringKubernetesTerraform

DevOpsNess

Practical AI, DevOps, Cloud, and Linux guidance for engineering teams

Weekly deep dives, implementation patterns, and reliability-focused playbooks.

Join NewsletterBrowse Posts
_d
devops/ness

A practical blog covering AI, cloud, DevOps, and modern technology for engineering teams.

Explore

  • Latest Articles
  • Archive
  • Reading List

Resources

  • About
  • RSS Feed
  • Newsletter

Legal

KU
Kiril urbonasDevOps Engineer and AI Enthusiast
|Apr 4, 2026
Monitoring That Actually Helps On-Call: Alerts, Dashboards, and Runbooks

Topics

Monitoring291Terraform214AWS174Kubernetes129Python115Security112CI/CD111LLM102Ansible99Linux99

Latest Articles

View All →
Secrets Management in Practice: From .env Files to Vault
••yesterday

Secrets Management in Practice: From .env Files to Vault

How we migrated from .env files checked into repos to a proper secrets management workflow with HashiCorp Vault and CI/CD integration.

KU
Kiril urbonas·3 min read
Read article
Incident Postmortems That Actually Prevent Repeat Failures
••2 days ago

Incident Postmortems That Actually Prevent Repeat Failures

How to write postmortems that lead to real improvements, not just documentation theater. Includes a template and real examples.

KU
Kiril urbonas·3 min read
Read article
Terraform Modules Done Right: Lessons from Managing 50+ Services
••3 days ago

Terraform Modules Done Right: Lessons from Managing 50+ Services

Practical patterns for Terraform modules at scale: versioning, composition, testing, and avoiding the monolith trap.

KU
Kiril urbonas·3 min read
Read article
Linux Performance Troubleshooting: A Real Incident Walkthrough
••4 days ago

Linux Performance Troubleshooting: A Real Incident Walkthrough

Step-by-step debugging of a production Linux server hitting 100% CPU. From top to perf to the actual fix.

KU
Kiril urbonas·2 min read
Read article
Prompt Engineering Patterns That Actually Work in Production
••5 days ago

Prompt Engineering Patterns That Actually Work in Production

Battle-tested prompt patterns from running LLM features in production: structured output, chain-of-thought, and graceful failure handling.

KU
Kiril urbonas·3 min read
Read article
AWS Cost Audit: 7 Things We Found Wasting Money Every Month
••6 days ago

AWS Cost Audit: 7 Things We Found Wasting Money Every Month

A real cost audit uncovered idle load balancers, oversized RDS instances, and forgotten snapshots. Here's what we found and how we fixed each one.

KU
Kiril urbonas·2 min read
Read article
How We Cut Our Docker Image Size by 80% and Why It Matters
••last week

How We Cut Our Docker Image Size by 80% and Why It Matters

A real walkthrough of shrinking bloated Docker images from 1.2GB to 240MB using multi-stage builds, Alpine, and dependency auditing.

KU
Kiril urbonas·2 min read
Read article
Model Fallback Policies for Customer-Facing AI: The Routing Rules That Kept SLA Intact
••last week

Model Fallback Policies for Customer-Facing AI: The Routing Rules That Kept SLA Intact

A real-world model fallback guide for customer-facing AI systems, covering how one team preserved response quality and support SLAs during a partial provider degradation.

KU
Kiril urbonas·4 min read
Read article
Artifact Promotion Instead of Rebuilds: The Release Control Pattern That Stopped Drift
••last week

Artifact Promotion Instead of Rebuilds: The Release Control Pattern That Stopped Drift

A practical artifact promotion guide for CI/CD teams that were tired of hearing 'it passed in staging' after production behaved differently because the release was rebuilt.

KU
Kiril urbonas·4 min read
Read article
RDS Restore Drills for Busy Teams: The Recovery Workflow That Surfaced Real Gaps
••last week

RDS Restore Drills for Busy Teams: The Recovery Workflow That Surfaced Real Gaps

A hands-on RDS restore drill guide for small cloud teams that thought backups were covered until a timed restore test exposed missing steps, DNS confusion, and stale credentials.

KU
Kiril urbonas·4 min read
Read article
Systemd Drop-In Overrides for Vendor Services: The Supportable Linux Ops Pattern
••last week

Systemd Drop-In Overrides for Vendor Services: The Supportable Linux Ops Pattern

A practical systemd drop-in guide built from a real operations problem: vendor unit files kept changing, but the team still needed consistent restart, environment, and logging behavior.

KU
Kiril urbonas·4 min read
Read article
Terraform Module Version Pinning: How One Platform Team Stopped Surprise Breakage
••last week

Terraform Module Version Pinning: How One Platform Team Stopped Surprise Breakage

A real-world Terraform module version pinning guide for platform teams that want safer upgrades, clearer ownership, and fewer broken pipelines after shared module releases.

KU
Kiril urbonas·4 min read
Read article
Page 1 of 46 · 543 posts
Previous
12...46
Next
  • Privacy
  • Terms

© 2026 DevOpsNess. By Kiril Urbonas.

RSSPrivacyTerms