We've been running the OTel Collector at the edge of every cluster for 18 months. The config patterns that lasted, the ones we ripped out, and a few processors that quietly saved us money.

On this page

OpenTelemetry Collector Pipelines: Real Configs That Survived Production

Eighteen months ago we adopted the OpenTelemetry Collector as the single egress point for traces, metrics, and logs out of every Kubernetes cluster. Below are the actual config patterns that survived; the ones we tried and removed; and a few processors that quietly cut our observability bill by 38%.

Architecture We Settled On #

code

┌────────────────────────────────────────────────┐
│ Per-pod app emits OTLP to                      │
│ ┌────────────────────────────────────────────┐ │
│ │ Per-node Agent (DaemonSet)                 │ │
│ │   - receives OTLP/gRPC                     │ │
│ │   - tail-sampling, batching, redaction     │ │
│ └────────────────────────┬───────────────────┘ │
│                          │ OTLP/gRPC           │
│ ┌────────────────────────▼───────────────────┐ │
│ │ Cluster Gateway (Deployment, 3 replicas)   │ │
│ │   - aggregation, routing                   │ │
│ │   - sends to vendor + S3 cold storage      │ │
│ └────────────────────────┬───────────────────┘ │
└──────────────────────────┼─────────────────────┘
                           ▼
                   Vendor APM + S3 Parquet

We use both layers because of one realization: most of the value (sampling, redaction, attribute scrubbing) belongs on the agent, but fan-out to multiple backends belongs on the gateway. Conflating them led to slow, fragile configs.

The Agent Config (Lightly Edited)#

yaml.yaml

# /etc/otelcol-agent/config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
        max_recv_msg_size_mib: 16
      http:
        endpoint: 0.0.0.0:4318
  hostmetrics:
    collection_interval: 30s
    scrapers:
      cpu: {}
      memory: {}
      disk: {}
      filesystem:
        exclude_mount_points:
          mount_points: ["/var/lib/docker/*", "/var/lib/kubelet/*"]
          match_type: regexp

processors:
  # 1. Memory limiter MUST be first.
  memory_limiter:
    check_interval: 2s
    limit_percentage: 80
    spike_limit_percentage: 25

  # 2. Drop noisy spans before they cost us anything.
  filter/drop_health_checks:
    error_mode: ignore
    traces:
      span:
        - 'attributes["http.target"] == "/healthz"'
        - 'attributes["http.target"] == "/metrics"'
        - 'name == "GET /favicon.ico"'

  # 3. Redact known PII attributes.
  attributes/redact:
    actions:
      - { key: http.request.header.authorization, action: delete }
      - { key: http.request.header.cookie,        action: delete }
      - { key: db.statement,                      action: hash }
      - { key: user.email,                        action: hash }

  # 4. Tail sampling: keep all errors + all slow + 5% of the rest.
  tail_sampling:
    decision_wait: 10s
    num_traces: 100000
    expected_new_traces_per_sec: 1500
    policies:
      - { name: errors,        type: status_code,        status_code: { status_codes: [ERROR] } }
      - { name: slow,          type: latency,            latency:     { threshold_ms: 500 } }
      - { name: random_5_pct,  type: probabilistic,      probabilistic: { sampling_percentage: 5 } }

  # 5. Batch right before exporting to amortize network overhead.
  batch:
    send_batch_size: 8192
    timeout: 5s

exporters:
  otlp/gateway:
    endpoint: otelcol-gateway.observability.svc:4317
    tls: { insecure: true }
    sending_queue:
      enabled: true
      num_consumers: 4
      queue_size: 5000
    retry_on_failure:
      enabled: true
      initial_interval: 5s
      max_interval: 30s
      max_elapsed_time: 5m

service:
  telemetry:
    metrics:
      address: 0.0.0.0:8888
    logs:
      level: info
  pipelines:
    traces:
      receivers:  [otlp]
      processors: [memory_limiter, filter/drop_health_checks, attributes/redact, tail_sampling, batch]
      exporters:  [otlp/gateway]
    metrics:
      receivers:  [otlp, hostmetrics]
      processors: [memory_limiter, batch]
      exporters:  [otlp/gateway]

What Each Processor Actually Bought Us #

`memory_limiter` — The One You Must Have First #

We learned the hard way: without memory_limiter first in the pipeline, a traffic spike will OOM-kill the agent and you'll lose data during the most interesting part of the incident. 80% with a 25% spike buffer has been stable for us across collectors with anywhere from 256MiB to 4GiB of memory.

`filter/drop_health_checks` — Cut Span Volume by 40%#

Health checks alone were 40% of our span volume. They had a sample-rate-of-1 because they never errored, so they passed every tail-sampling policy. Dropping them at the agent saved us from paying the gateway, the wire, and the vendor.

If you only adopt one processor from this post: drop the noise at the edge.

`attributes/redact` — Compliance and Bill Both #

Hashing db.statement instead of dropping it kept query-shape grouping working in the vendor UI without sending raw SQL (which sometimes contained user data). Hashing also reduced average attribute size from 380 bytes to 64.

`tail_sampling` — The Tradeoff to Understand #

Tail sampling needs to buffer all spans of a trace until decision_wait elapses. num_traces: 100000 means up to 100k traces in memory at once. We've seen the agent at peak hold ~2.5GiB just for the buffer.

The win: from a 100% retention rate, we keep about 18% of traces (errors + slow + 5% baseline) and miss approximately none of the interesting incidents. Vendor cost dropped accordingly.

`batch` — Boring But Required #

Without batching, gRPC overhead per span was eating ~25% of CPU. With send_batch_size: 8192, CPU dropped to 6%.

Things We Tried and Removed #

`groupbyattrs` Before `tail_sampling`#

We thought routing by service name to different sampling policies would help. It made the config harder to reason about for marginal gains. We deleted it; we filter by service in the policy itself now.

Custom Lua Processor #

A lua processor that built derived metrics from spans seemed clever. It added 18ms of latency per span and caused two production incidents when the lua VM hung. We replaced it with a transform-processor expression and never looked back.

Sending Logs Through OTel #

We tried filelog receiver and OTLP logs to the vendor. The schema mismatch with our existing log pipeline was painful, and the vendor charged 3x more for log ingest via OTLP. Logs still go through the legacy Fluent Bit → Loki path; only traces and metrics flow through OTel.

The Gateway Config (Trimmed)#

yaml.yaml

processors:
  memory_limiter: { ... }
  batch: { send_batch_size: 16384, timeout: 10s }

  # Add cluster + region attributes that agents don't know.
  resource/cluster:
    attributes:
      - { key: k8s.cluster.name, value: prod-us-east-1, action: insert }
      - { key: deployment.environment, value: prod, action: insert }

exporters:
  otlphttp/vendor:
    endpoint: https://otlp.vendor.example
    headers: { api-key: ${env:VENDOR_API_KEY} }
    sending_queue: { num_consumers: 8, queue_size: 10000 }

  awss3/cold:
    s3uploader:
      region: us-east-1
      s3_bucket: telemetry-cold
      s3_prefix: traces
      file_format: parquet
      compression: zstd

service:
  pipelines:
    traces:
      receivers:  [otlp]
      processors: [memory_limiter, resource/cluster, batch]
      exporters:  [otlphttp/vendor, awss3/cold]

The awss3/cold exporter is a recent addition. It writes the same data to S3 in Parquet format. We use it for queries that the vendor charges per-scan for, and as our long-term retention beyond the vendor's 30-day window.

Numbers After 18 Months #

Metric	Year-1 (no OTel)	Year-2 (OTel + sampling)
Spans/sec leaving cluster	12.4k	2.1k
Vendor APM bill	$24k/mo	$15k/mo
Trace retention (vendor)	7 days	30 days
Cold archive (S3)	none	18 mo
Mean time to find a slow trace	4 min	90 sec

That's a 38% reduction in vendor bill while extending retention from 7 to 30 days and adding 18 months of cold archive — all from being deliberate about what we send.

memory_limiter first, batch last. Every pipeline. No exceptions.
Drop noise at the agent, not the gateway. You pay for every byte that crosses the wire.
Tail-sample with policy unions, not intersections. Errors OR slow OR baseline — never AND.
Hash, don't delete, attributes you might want to group by. db.statement hashed is still useful; deleted is gone.
Pin the collector version per cluster. OTel Collector ships often; surprise upgrades surprise you.
Run the collector itself with metrics scraped. otelcol_processor_dropped_spans is the one that catches everything.

When You Don't Need Two Layers #

If you have a single cluster, low cardinality, and one backend, run one Collector deployment and call it done. The agent + gateway pattern is for scale and multi-tenancy. Don't add it preemptively.

OpenTelemetry Collector Pipelines: Real Configs That Survived Production

OpenTelemetry Collector Pipelines: Real Configs That Survived Production

Architecture We Settled On #

The Agent Config (Lightly Edited)#

What Each Processor Actually Bought Us #

`memory_limiter` — The One You Must Have First #

`filter/drop_health_checks` — Cut Span Volume by 40%#

`attributes/redact` — Compliance and Bill Both #

`tail_sampling` — The Tradeoff to Understand #

`batch` — Boring But Required #

Things We Tried and Removed #

`groupbyattrs` Before `tail_sampling`#

Custom Lua Processor #

Sending Logs Through OTel #

The Gateway Config (Trimmed)#

Numbers After 18 Months #

When You Don't Need Two Layers #

Stay Updated

Blue/Green Deploys for Stateful Services: A Postgres Cutover Story

Self-Hosted LLMs vs OpenAI API: A Cost-vs-Latency Analysis After 6 Months

More from DevOps

Pre-Commit Hooks That Saved Our Repo: 7 Real Examples

Blue/Green Deploys for Stateful Services: A Postgres Cutover Story

Incident Postmortems That Actually Prevent Repeat Failures

Pre-Commit Hooks That Saved Our Repo: 7 Real Examples

Blue/Green Deploys for Stateful Services: A Postgres Cutover Story

Incident Postmortems That Actually Prevent Repeat Failures

How We Cut Our Docker Image Size by 80% and Why It Matters

EKS Auto Mode: What Worked, What Broke in Our Migration

systemd Timers vs Cron: When We Switched and What We Learned

About Kiril urbonas