A real cost audit uncovered idle load balancers, oversized RDS instances, and forgotten snapshots. Here's what we found and how we fixed each one.
After our AWS bill crossed $18,000/month for a 15-person startup, we did a proper audit. We found $6,200 in monthly waste. Here's every item.
Three ALBs were still running from decommissioned staging environments. Each costs ~$16/month base plus LCU charges.
Fix: We added a Terraform lifecycle check that tags ALBs with the owning team and a TTL. A weekly Lambda deletes anything past its TTL with zero healthy targets.
Our production database was on db.r6g.2xlarge. CloudWatch showed average CPU at 12% and memory at 35%.
Fix: Downgraded to db.r6g.large during a maintenance window. Set up a CloudWatch alarm for CPU > 70% so we'll know when to scale back up.
14 EBS volumes were sitting with status "available"—leftovers from terminated EC2 instances.
Fix: Scripted a check:
aws ec2 describe-volumes \
--filters Name=status,Values=available \
--query 'Volumes[*].{ID:VolumeId,Size:Size,Created:CreateTime}' \
--output table
Snapshot anything older than 30 days, then delete.
We had 2,400 EBS snapshots going back 3 years. Most were from AMIs we no longer use.
Fix: Implemented AWS Data Lifecycle Manager with a 90-day retention policy.
Our NAT Gateway was processing 800GB/month. Much of it was S3 traffic from private subnets.
Fix: Added a VPC Gateway Endpoint for S3. Free, and it cut NAT traffic by 60%.
resource "aws_vpc_endpoint" "s3" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.us-east-1.s3"
route_table_ids = [aws_route_table.private.id]
}
Every Lambda was set to 1024MB by default. AWS Power Tuning showed most needed 256MB.
Fix: Ran Power Tuning on our top 10 functions and right-sized them.
We were paying on-demand for 4 EC2 instances that had been running for 2 years.
Fix: Purchased 1-year no-upfront reserved instances for predictable workloads.
The $6,200/month we saved required about 8 hours of work. That's an annualized return of $74,400 for one day of effort.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
A real walkthrough of shrinking bloated Docker images from 1.2GB to 240MB using multi-stage builds, Alpine, and dependency auditing.
Battle-tested prompt patterns from running LLM features in production: structured output, chain-of-thought, and graceful failure handling.
Explore more articles in this category
We moved a 60-node production EKS cluster to Auto Mode. Some pain points evaporated, others got harder. The cost picture is more nuanced than the marketing suggests.
We replaced 14 long-lived IAM users with SSO + temporary credentials. The migration plan, the gotchas, and the policies we now enforce.
How we migrated from .env files checked into repos to a proper secrets management workflow with HashiCorp Vault and CI/CD integration.
Evergreen posts worth revisiting.