Skip to main content

Reading List About

Blog

Practical articles on AI, DevOps, Cloud, Linux, and infrastructure engineering.

Pre-Commit Hooks That Saved Our Repo: 7 Real Examples

••last week

Pre-Commit Hooks That Saved Our Repo: 7 Real Examples

Every hook on this list caught a bug or a security issue in the last twelve months. The configs are short. The savings have been considerable.

EKS Auto Mode: What Worked, What Broke in Our Migration

••last week

EKS Auto Mode: What Worked, What Broke in Our Migration

We moved a 60-node production EKS cluster to Auto Mode. Some pain points evaporated, others got harder. The cost picture is more nuanced than the marketing suggests.

Self-Hosted LLMs vs OpenAI API: A Cost-vs-Latency Analysis After 6 Months

••last week

Self-Hosted LLMs vs OpenAI API: A Cost-vs-Latency Analysis After 6 Months

We ran the same workload on both for half a year. The break-even point isn't where most blog posts say it is — and the latency story has more nuance than throughput-per-dollar charts admit.

OpenTelemetry Collector Pipelines: Real Configs That Survived Production

••last week

OpenTelemetry Collector Pipelines: Real Configs That Survived Production

We've been running the OTel Collector at the edge of every cluster for 18 months. The config patterns that lasted, the ones we ripped out, and a few processors that quietly saved us money.

Blue/Green Deploys for Stateful Services: A Postgres Cutover Story

••last week

Blue/Green Deploys for Stateful Services: A Postgres Cutover Story

Blue/green is easy for stateless services. We did it for our primary Postgres cluster with 3.2TB of data and ~8k connections. Here's exactly how — and what almost went wrong.

systemd Timers vs Cron: When We Switched and What We Learned

••2 weeks ago

systemd Timers vs Cron: When We Switched and What We Learned

We migrated 47 cron jobs to systemd timers across our fleet. The mechanical conversion was easy. The interesting parts were the bugs we found that cron had been hiding.

Zero Trust on AWS: Lessons From Implementing IAM Identity Center

••2 weeks ago

Zero Trust on AWS: Lessons From Implementing IAM Identity Center

We replaced 14 long-lived IAM users with SSO + temporary credentials. The migration plan, the gotchas, and the policies we now enforce.

Embedding Quality in RAG: How We Cut Hallucinations by 60%

••2 weeks ago

Embedding Quality in RAG: How We Cut Hallucinations by 60%

Six months running RAG in production taught us that the retrieval step matters far more than the model. Concrete techniques that moved the needle, with before/after numbers.

Database Migrations Without Downtime: Patterns From Three Real Cutovers

••2 weeks ago

Database Migrations Without Downtime: Patterns From Three Real Cutovers

How we shipped three schema migrations with zero customer impact. Expand-then-contract, dual-writes, and the rollback plan we never had to use — but tested anyway.

Monitoring That Actually Helps On-Call: Alerts, Dashboards, and Runbooks

••2 weeks ago

Monitoring That Actually Helps On-Call: Alerts, Dashboards, and Runbooks

How we went from 200 alerts per week (most ignored) to 15 actionable alerts with clear runbooks and useful dashboards.

Secrets Management in Practice: From .env Files to Vault

••2 weeks ago

Secrets Management in Practice: From .env Files to Vault

How we migrated from .env files checked into repos to a proper secrets management workflow with HashiCorp Vault and CI/CD integration.

Incident Postmortems That Actually Prevent Repeat Failures

••2 weeks ago

Incident Postmortems That Actually Prevent Repeat Failures

How to write postmortems that lead to real improvements, not just documentation theater. Includes a template and real examples.

Page 1 of 47 · 553 posts

1 2...47