We ran the same workload on both for half a year. The break-even point isn't where most blog posts say it is — and the latency story has more nuance than throughput-per-dollar charts admit.
For the last six months we've run the same workload — internal-facing assistant for ~2,400 employees — against both OpenAI's API and a self-hosted Llama 3 70B Instruct deployment. Not as a science fair, but as production. Here are the numbers, and the calls we'd make differently next time.
gpt-4o-mini for 90% of trafficgpt-4o for 10% (escalation queries)g6e.12xlarge (4× L40S each = 16 GPUs total) on EKSMeta-Llama-3-70B-Instruct quantized to AWQ-INT4| Input | Output | |
|---|---|---|
| gpt-4o-mini | $0.15 | $0.60 |
| gpt-4o (10% escalations) | $2.50 | $10.00 |
Effective blended rate at our mix: $0.42/M input + $1.74/M output.
Daily token volume:
Daily API cost: 532 × 0.42 + 72 × 1.74 ≈ $348/day → ~$10,400/mo
After 22% cache hit rate: ~$8,100/mo.
g6e.12xlarge reserved 1-year, no upfront: $2,815/mo each = $11,260/mo in computeTotal self-hosted: ~$15,040/mo.
After 22% cache hit rate the savings on compute are zero because the GPUs are reserved 24/7. Cache hits just give us idle headroom for traffic spikes.
For us, yes — by ~$7k/mo. But the math flips quickly with three changes:
We now run a second internal workload (PR review assistant) on the same cluster. Effective GPU utilization went from 41% to 78%. The cost-per-request is back below OpenAI for the combined load.
Throughput-per-dollar is the headline number every comparison post focuses on. The latency picture is messier and arguably more important for user experience.
| p50 | p95 | p99 | |
|---|---|---|---|
| OpenAI gpt-4o-mini | 380ms | 720ms | 1.4s |
| Self-hosted (idle batch) | 95ms | 180ms | 290ms |
| Self-hosted (full batch) | 240ms | 480ms | 870ms |
Self-hosted was ~3× faster on TTFT because we control the network hop and the batch scheduling. For chat UX, TTFT is what users feel as "is it working?"
| p50 | p95 | |
|---|---|---|
| OpenAI gpt-4o-mini | 92 t/s | 41 t/s |
| Self-hosted (idle batch) | 78 t/s | 52 t/s |
| Self-hosted (full batch) | 38 t/s | 19 t/s |
OpenAI was generally faster per token but with more variance. Self-hosted under load was slower per token but predictable.
This is where self-hosted really shone. p99 on OpenAI varied across the day; we'd see occasional 8–15 second responses that we couldn't explain or escalate. Self-hosted p99 was 870ms and rock-stable because the queue was ours.
For a customer-facing product where you SLA on p99, this matters a lot.
The $3,500/mo "engineer time" line above is not a guess. We tracked it. Six months in:
That's roughly 15 hours/month at $230/hr fully loaded = $3,450/mo.
We were briefly considering self-hosting for a single workload that was projected to be our main use case. It wasn't. It became one of three. If we'd committed GPU capacity to a single workload we'd have wasted most of it.
Rule we apply now: don't reserve GPUs unless you have at least two workloads that can share them, and a third one in your roadmap.
Months 1–3 we burned ~$8k/mo on the API. That bill funded the eval framework and gave us real production traffic data. With that data the GPU sizing decision was straightforward.
If we'd jumped to self-hosting in month 1 we'd have built for the wrong shape of workload.
Our internal eval set caught two regressions on the self-hosted side that would have shipped to users otherwise (prompt caching bug and a tokenizer mismatch). We run the eval against both stacks every deploy.
# Sketch of the eval harness
def run_eval(stack, suite):
results = []
for prompt in suite:
completion = stack.complete(prompt.input)
score = grade(prompt.expected, completion, llm_judge="gpt-4o")
results.append({"prompt_id": prompt.id, "score": score})
return summarize(results)
Semantic cache hit rate of 22% pays for itself either way. It's a no-brainer.
You should self-host if at least two of the following hold:
If most of these are true:
The right answer depends on numbers we can't predict for you. Run both for a quarter. The decision will be obvious from the data.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
We've been running the OTel Collector at the edge of every cluster for 18 months. The config patterns that lasted, the ones we ripped out, and a few processors that quietly saved us money.
We moved a 60-node production EKS cluster to Auto Mode. Some pain points evaporated, others got harder. The cost picture is more nuanced than the marketing suggests.
Explore more articles in this category
Six months running RAG in production taught us that the retrieval step matters far more than the model. Concrete techniques that moved the needle, with before/after numbers.
Battle-tested prompt patterns from running LLM features in production: structured output, chain-of-thought, and graceful failure handling.
A real-world model fallback guide for customer-facing AI systems, covering how one team preserved response quality and support SLAs during a partial provider degradation.