We ran the same workload on both for half a year. The break-even point isn't where most blog posts say it is — and the latency story has more nuance than throughput-per-dollar charts admit.

Self-Hosted LLMs vs OpenAI API: A Cost-vs-Latency Analysis After 6 Months

For the last six months we've run the same workload — internal-facing assistant for ~2,400 employees — against both OpenAI's API and a self-hosted Llama 3 70B Instruct deployment. Not as a science fair, but as production. Here are the numbers, and the calls we'd make differently next time.

The Workload #

~190k requests/day across business hours (≈ 5.5k/hour peak)
Median input: 2,800 tokens (RAG-augmented chat)
Median output: 380 tokens
Quality bar: internal eval set of 800 prompts; Llama 70B and GPT-4o-mini scored within 2 percentage points of each other
Latency target: p95 < 4 seconds for full response

The Two Stacks #

Stack A: OpenAI API #

gpt-4o-mini for 90% of traffic
gpt-4o for 10% (escalation queries)
A small caching layer in Redis (semantic cache hit rate: 22%)

Stack B: Self-Hosted #

4× g6e.12xlarge (4× L40S each = 16 GPUs total) on EKS
vLLM serving Meta-Llama-3-70B-Instruct quantized to AWQ-INT4
Triton-based router with continuous batching, max-batch-size 64
Same Redis semantic cache in front

Cost Math (Actual Bills)#

OpenAI API path — $ per million tokens #

	Input	Output
gpt-4o-mini	$0.15	$0.60
gpt-4o (10% escalations)	$2.50	$10.00

Effective blended rate at our mix: $0.42/M input + $1.74/M output.

Daily token volume:

Input: 190k × 2,800 ≈ 532M tokens
Output: 190k × 380 ≈ 72M tokens

Daily API cost: 532 × 0.42 + 72 × 1.74 ≈ $348/day → ~$10,400/mo

After 22% cache hit rate: ~$8,100/mo.

Self-hosted path — $ per month #

4× g6e.12xlarge reserved 1-year, no upfront: $2,815/mo each = $11,260/mo in compute
EKS control plane + load balancer + EBS: ~$280/mo
Engineer time to maintain (averaged): $3,500/mo (we measured this — see below)

Total self-hosted: ~$15,040/mo.

After 22% cache hit rate the savings on compute are zero because the GPUs are reserved 24/7. Cache hits just give us idle headroom for traffic spikes.

So Self-Hosted Was More Expensive?#

For us, yes — by ~$7k/mo. But the math flips quickly with three changes:

Higher utilization: We sized for peak. If we ran another workload on the same GPUs (we now do — see below), the marginal cost of the second workload is near zero.
Higher token volume: Above ~12M output tokens/day, self-hosted starts winning even at our utilization.
Lower-quality acceptable: Llama 3 8B at 4-bit on much smaller GPUs is dramatically cheaper. Our quality bar required 70B.

We now run a second internal workload (PR review assistant) on the same cluster. Effective GPU utilization went from 41% to 78%. The cost-per-request is back below OpenAI for the combined load.

Latency: Where It Got Interesting #

Throughput-per-dollar is the headline number every comparison post focuses on. The latency picture is messier and arguably more important for user experience.

Time-to-First-Token (TTFT)#

	p50	p95	p99
OpenAI gpt-4o-mini	380ms	720ms	1.4s
Self-hosted (idle batch)	95ms	180ms	290ms
Self-hosted (full batch)	240ms	480ms	870ms

Self-hosted was ~3× faster on TTFT because we control the network hop and the batch scheduling. For chat UX, TTFT is what users feel as "is it working?"

Tokens-Per-Second After First Token #

	p50	p95
OpenAI gpt-4o-mini	92 t/s	41 t/s
Self-hosted (idle batch)	78 t/s	52 t/s
Self-hosted (full batch)	38 t/s	19 t/s

OpenAI was generally faster per token but with more variance. Self-hosted under load was slower per token but predictable.

Tail Latency Stability #

This is where self-hosted really shone. p99 on OpenAI varied across the day; we'd see occasional 8–15 second responses that we couldn't explain or escalate. Self-hosted p99 was 870ms and rock-stable because the queue was ours.

For a customer-facing product where you SLA on p99, this matters a lot.

The "Hidden" Operational Cost #

The $3,500/mo "engineer time" line above is not a guess. We tracked it. Six months in:

vLLM upgrades: 2 hours/month (release cadence is fast; sometimes a pin is required)
GPU node failures: 1 incident/quarter, ~4 hours each
Quantization re-validation when a new model lands: 6–8 hours per model
Cluster scaling tweaks (batch size, KV cache config): 4 hours/month
On-call response to GPU-related pages: ~2 incidents/month, 30 min each

That's roughly 15 hours/month at $230/hr fully loaded = $3,450/mo.

# Sketch of the eval harness
def run_eval(stack, suite):
    results = []
    for prompt in suite:
        completion = stack.complete(prompt.input)
        score = grade(prompt.expected, completion, llm_judge="gpt-4o")
        results.append({"prompt_id": prompt.id, "score": score})
    return summarize(results)

Always Front Both With a Cache #

Semantic cache hit rate of 22% pays for itself either way. It's a no-brainer.

When Self-Hosted Wins #

You should self-host if at least two of the following hold:

You have multiple workloads that can share GPUs.
Your quality bar is met by an open-weights model in a size you can afford to serve.
You need predictable tail latency (p99 SLAs).
You have regulatory constraints that make outbound API calls hard.
Your token volume is high enough that the API bill clearly exceeds GPU + ops cost.

When OpenAI Wins #

If most of these are true:

You have one workload with bursty traffic.
Your quality requirements demand frontier models.
You'd rather spend engineer time on product than on serving infra.
You're early and the workload shape isn't stable yet.

Best Practices For Either #

Cache aggressively with semantic similarity, not exact match.
Stream responses to users; don't wait for completion.
Track $/request and tokens/request per route, not just totals.
Keep an evaluation suite versioned with your prompt code.
Plan for fallback: self-hosted should fall back to a paid API on incident; paid API should have a degraded mode if quotas exhaust.

The right answer depends on numbers we can't predict for you. Run both for a quarter. The decision will be obvious from the data.

Self-Hosted LLMs vs OpenAI API: A Cost-vs-Latency Analysis After 6 Months

Self-Hosted LLMs vs OpenAI API: A Cost-vs-Latency Analysis After 6 Months

The Workload #

The Two Stacks #

Stack A: OpenAI API #

Stack B: Self-Hosted #

Cost Math (Actual Bills)#

OpenAI API path — $ per million tokens #

Self-hosted path — $ per month #

So Self-Hosted Was More Expensive?#

Latency: Where It Got Interesting #

Time-to-First-Token (TTFT)#

Tokens-Per-Second After First Token #

Tail Latency Stability #

The "Hidden" Operational Cost #

What We'd Do Differently #

Don't Self-Host for "Maybe Later" Workloads #

Use OpenAI for Everything Until You Have Numbers #

Build the Eval Framework Before the Infra #

Always Front Both With a Cache #

When Self-Hosted Wins #

When OpenAI Wins #

Best Practices For Either #

Stay Updated

OpenTelemetry Collector Pipelines: Real Configs That Survived Production

EKS Auto Mode: What Worked, What Broke in Our Migration

More from AI

Embedding Quality in RAG: How We Cut Hallucinations by 60%

Prompt Engineering Patterns That Actually Work in Production

Model Fallback Policies for Customer-Facing AI: The Routing Rules That Kept SLA Intact

Embedding Quality in RAG: How We Cut Hallucinations by 60%

Prompt Engineering Patterns That Actually Work in Production

Model Fallback Policies for Customer-Facing AI: The Routing Rules That Kept SLA Intact

Embedding Model Upgrades Without Search Chaos: A Safer RAG Rollout Pattern

Prompt Versioning and Regression Testing: How Teams Avoid Silent AI Regressions

RAG Retrieval Quality Evaluation: The Checks We Added After Bad Answers Reached Production

About Kiril urbonas