We've been running the OTel Collector at the edge of every cluster for 18 months. The config patterns that lasted, the ones we ripped out, and a few processors that quietly saved us money.
Eighteen months ago we adopted the OpenTelemetry Collector as the single egress point for traces, metrics, and logs out of every Kubernetes cluster. Below are the actual config patterns that survived; the ones we tried and removed; and a few processors that quietly cut our observability bill by 38%.
┌────────────────────────────────────────────────┐
│ Per-pod app emits OTLP to │
│ ┌────────────────────────────────────────────┐ │
│ │ Per-node Agent (DaemonSet) │ │
│ │ - receives OTLP/gRPC │ │
│ │ - tail-sampling, batching, redaction │ │
│ └────────────────────────┬───────────────────┘ │
│ │ OTLP/gRPC │
│ ┌────────────────────────▼───────────────────┐ │
│ │ Cluster Gateway (Deployment, 3 replicas) │ │
│ │ - aggregation, routing │ │
│ │ - sends to vendor + S3 cold storage │ │
│ └────────────────────────┬───────────────────┘ │
└──────────────────────────┼─────────────────────┘
▼
Vendor APM + S3 Parquet
We use both layers because of one realization: most of the value (sampling, redaction, attribute scrubbing) belongs on the agent, but fan-out to multiple backends belongs on the gateway. Conflating them led to slow, fragile configs.
# /etc/otelcol-agent/config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
max_recv_msg_size_mib: 16
http:
endpoint: 0.0.0.0:4318
hostmetrics:
collection_interval: 30s
scrapers:
cpu: {}
memory: {}
disk: {}
filesystem:
exclude_mount_points:
mount_points: ["/var/lib/docker/*", "/var/lib/kubelet/*"]
match_type: regexp
processors:
# 1. Memory limiter MUST be first.
memory_limiter:
check_interval: 2s
limit_percentage: 80
spike_limit_percentage: 25
# 2. Drop noisy spans before they cost us anything.
filter/drop_health_checks:
error_mode: ignore
traces:
span:
- 'attributes["http.target"] == "/healthz"'
- 'attributes["http.target"] == "/metrics"'
- 'name == "GET /favicon.ico"'
# 3. Redact known PII attributes.
attributes/redact:
actions:
- { key: http.request.header.authorization, action: delete }
- { key: http.request.header.cookie, action: delete }
- { key: db.statement, action: hash }
- { key: user.email, action: hash }
# 4. Tail sampling: keep all errors + all slow + 5% of the rest.
tail_sampling:
decision_wait: 10s
num_traces: 100000
expected_new_traces_per_sec: 1500
policies:
- { name: errors, type: status_code, status_code: { status_codes: [ERROR] } }
- { name: slow, type: latency, latency: { threshold_ms: 500 } }
- { name: random_5_pct, type: probabilistic, probabilistic: { sampling_percentage: 5 } }
# 5. Batch right before exporting to amortize network overhead.
batch:
send_batch_size: 8192
timeout: 5s
exporters:
otlp/gateway:
endpoint: otelcol-gateway.observability.svc:4317
tls: { insecure: true }
sending_queue:
enabled: true
num_consumers: 4
queue_size: 5000
retry_on_failure:
enabled: true
initial_interval: 5s
max_interval: 30s
max_elapsed_time: 5m
service:
telemetry:
metrics:
address: 0.0.0.0:8888
logs:
level: info
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, filter/drop_health_checks, attributes/redact, tail_sampling, batch]
exporters: [otlp/gateway]
metrics:
receivers: [otlp, hostmetrics]
processors: [memory_limiter, batch]
exporters: [otlp/gateway]
memory_limiter — The One You Must Have First#We learned the hard way: without memory_limiter first in the pipeline, a traffic spike will OOM-kill the agent and you'll lose data during the most interesting part of the incident. 80% with a 25% spike buffer has been stable for us across collectors with anywhere from 256MiB to 4GiB of memory.
filter/drop_health_checks — Cut Span Volume by 40%#Health checks alone were 40% of our span volume. They had a sample-rate-of-1 because they never errored, so they passed every tail-sampling policy. Dropping them at the agent saved us from paying the gateway, the wire, and the vendor.
If you only adopt one processor from this post: drop the noise at the edge.
attributes/redact — Compliance and Bill Both#Hashing db.statement instead of dropping it kept query-shape grouping working in the vendor UI without sending raw SQL (which sometimes contained user data). Hashing also reduced average attribute size from 380 bytes to 64.
tail_sampling — The Tradeoff to Understand#Tail sampling needs to buffer all spans of a trace until decision_wait elapses. num_traces: 100000 means up to 100k traces in memory at once. We've seen the agent at peak hold ~2.5GiB just for the buffer.
The win: from a 100% retention rate, we keep about 18% of traces (errors + slow + 5% baseline) and miss approximately none of the interesting incidents. Vendor cost dropped accordingly.
batch — Boring But Required#Without batching, gRPC overhead per span was eating ~25% of CPU. With send_batch_size: 8192, CPU dropped to 6%.
groupbyattrs Before tail_sampling#We thought routing by service name to different sampling policies would help. It made the config harder to reason about for marginal gains. We deleted it; we filter by service in the policy itself now.
A lua processor that built derived metrics from spans seemed clever. It added 18ms of latency per span and caused two production incidents when the lua VM hung. We replaced it with a transform-processor expression and never looked back.
We tried filelog receiver and OTLP logs to the vendor. The schema mismatch with our existing log pipeline was painful, and the vendor charged 3x more for log ingest via OTLP. Logs still go through the legacy Fluent Bit → Loki path; only traces and metrics flow through OTel.
processors:
memory_limiter: { ... }
batch: { send_batch_size: 16384, timeout: 10s }
# Add cluster + region attributes that agents don't know.
resource/cluster:
attributes:
- { key: k8s.cluster.name, value: prod-us-east-1, action: insert }
- { key: deployment.environment, value: prod, action: insert }
exporters:
otlphttp/vendor:
endpoint: https://otlp.vendor.example
headers: { api-key: ${env:VENDOR_API_KEY} }
sending_queue: { num_consumers: 8, queue_size: 10000 }
awss3/cold:
s3uploader:
region: us-east-1
s3_bucket: telemetry-cold
s3_prefix: traces
file_format: parquet
compression: zstd
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, resource/cluster, batch]
exporters: [otlphttp/vendor, awss3/cold]
The awss3/cold exporter is a recent addition. It writes the same data to S3 in Parquet format. We use it for queries that the vendor charges per-scan for, and as our long-term retention beyond the vendor's 30-day window.
| Metric | Year-1 (no OTel) | Year-2 (OTel + sampling) |
|---|---|---|
| Spans/sec leaving cluster | 12.4k | 2.1k |
| Vendor APM bill | $24k/mo | $15k/mo |
| Trace retention (vendor) | 7 days | 30 days |
| Cold archive (S3) | none | 18 mo |
| Mean time to find a slow trace | 4 min | 90 sec |
That's a 38% reduction in vendor bill while extending retention from 7 to 30 days and adding 18 months of cold archive — all from being deliberate about what we send.
memory_limiter first, batch last. Every pipeline. No exceptions.db.statement hashed is still useful; deleted is gone.otelcol_processor_dropped_spans is the one that catches everything.If you have a single cluster, low cardinality, and one backend, run one Collector deployment and call it done. The agent + gateway pattern is for scale and multi-tenancy. Don't add it preemptively.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
Blue/green is easy for stateless services. We did it for our primary Postgres cluster with 3.2TB of data and ~8k connections. Here's exactly how — and what almost went wrong.
We ran the same workload on both for half a year. The break-even point isn't where most blog posts say it is — and the latency story has more nuance than throughput-per-dollar charts admit.
Explore more articles in this category
Every hook on this list caught a bug or a security issue in the last twelve months. The configs are short. The savings have been considerable.
Blue/green is easy for stateless services. We did it for our primary Postgres cluster with 3.2TB of data and ~8k connections. Here's exactly how — and what almost went wrong.
How to write postmortems that lead to real improvements, not just documentation theater. Includes a template and real examples.