OTEL

Building an OTEL-native observability pipeline

Priya Nakashima February 12, 2025

Abstract data pipeline flow visualization

OpenTelemetry has won. The vendor debates are mostly over — OTEL is the instrumentation standard that the industry is converging on, backed by every major observability vendor and cloud provider. But "OTEL is the standard" and "we have a working OTEL pipeline" are very different statements. The gap between them is where most teams spend six months they didn't plan to spend.

This is what we learned building our own OTEL-native pipeline and helping early adopters of Devloom instrument their services. Not the getting-started tutorial — you can find that on the OpenTelemetry docs site. The parts that bite you once you're past hello-world.

Understanding the Collector's role before you architect anything

The OpenTelemetry Collector is the piece most teams underestimate. It's not a sidecar you deploy and forget. The Collector is a pipeline in itself: receivers ingest signals from your services, processors transform them (filtering, sampling, attribute enrichment, batching), and exporters ship them to your backend.

The architectural decision that matters most is whether you run the Collector as an agent (one per node), a gateway (centralized), or both. For Kubernetes environments with 50+ services, the answer is almost always both: agents handle per-node collection and initial filtering, a gateway cluster handles batching and routing to multiple backends. If you only run agents, you end up with N×backends fan-out that's hard to manage. If you only run a gateway, you lose the ability to do per-node sampling decisions.

The Collector config is YAML, and it will grow. Start with explicit naming conventions for your pipelines (traces/production, metrics/infra, logs/application) and structure the config in separate files that get merged. Don't put everything in one 400-line YAML that nobody understands six months later.

Instrumentation strategy: auto vs manual, and when to mix

OpenTelemetry provides auto-instrumentation agents for most runtimes — Java, Python, Node.js, .NET, Go (via contrib packages). The auto-instrumentation will get you spans for HTTP requests, database queries, and common framework operations without code changes. This is genuinely useful and the right place to start.

The limitation is that auto-instrumentation gives you the skeleton, not the flesh. It tells you "an HTTP call was made to /api/orders" with a duration. It doesn't tell you "this call processed 47 line items for order ID 8821-B." That business context — the attributes that make a trace useful for debugging rather than just for counting — requires manual instrumentation at the application level.

The pattern that works: deploy auto-instrumentation first to establish the trace skeleton across all services. Then identify the 20% of spans that appear in 80% of your incidents — typically the spans that represent your core business transactions — and add manual attribute enrichment to those. You don't need to manually instrument everything. You need to manually instrument the paths that matter when things go wrong.

One specific thing to always add manually: the correlation IDs that cross service boundaries. If you have an order ID, a session ID, a user ID that flows through multiple services, add those as span attributes. Without them, you can correlate traces within a single service but not across the dependency graph — which is exactly where you need it most.

Sampling: the decision you can't defer

At scale, tracing everything is not economically viable. A service handling 10,000 requests per minute generates sampling decisions that compound across your service graph. If you have 20 services and keep all traces, your trace storage costs scale with your traffic, not with your incident rate.

The sampling strategies available are: head-based sampling (decision at trace ingestion), tail-based sampling (decision after all spans for a trace are collected), and probabilistic sampling (keep N% of all traces). Each has trade-offs that matter in practice.

Head-based sampling is simple but wrong for many use cases. If you keep 10% of traces at the head, you're dropping 90% of traces before knowing whether they're interesting. Errors, high-latency requests, and anomalous behavior will be sampled at the same rate as normal requests. You'll have a biased corpus that underrepresents exactly the traces you need.

Tail-based sampling is correct but complex. The Collector's tail sampling processor buffers spans, waits for the full trace to arrive, then makes a sampling decision based on the complete trace (was it an error? was p99 latency exceeded?). The complexity is that all spans for a trace must arrive at the same Collector instance — which requires consistent hashing in your load balancer and careful buffer sizing. In practice, aim to keep 100% of error traces, 100% of traces exceeding your p99 SLO, and a small percentage (1-5%) of normal traces for baseline.

We're not saying head-based sampling is never right — if you genuinely can't afford the operational overhead of tail sampling and you have high cardinality traffic, probabilistic head sampling with a generous rate (say, 20%) is a defensible choice. Just be aware that you're trading debugging fidelity for operational simplicity.

Context propagation across async boundaries

The most common place OTEL pipelines silently break is at async boundaries: message queues, job schedulers, event buses. When a request triggers a Kafka message that's consumed by a separate worker, the trace context needs to travel with that message — in the message headers — or you end up with two disconnected traces instead of one connected view of the work.

OpenTelemetry's W3C Trace Context propagation (the traceparent header) handles HTTP and gRPC automatically. For Kafka, you need to use the messaging semconv attributes and explicitly propagate the context by extracting it at the producer and injecting it at the consumer. The OTEL contrib instrumentation for Kafka handles this if you configure it correctly — but it's not on by default in all SDKs, and the Java agent particularly requires verifying that the messaging.kafka.consumer.group attribute is being set.

Async boundary propagation is the difference between traces that show you the full causal chain (HTTP request → queue message → worker → database) and traces that show you disconnected fragments. For incident investigation, the difference is material.

Connecting OTEL signals to deployment events

OTEL standardizes how you collect and ship observability signals. It doesn't standardize how you correlate those signals with deployment context. That correlation — knowing that the P99 latency increase at 14:32 UTC coincided with the deploy of order-service v2.4.1 — is what turns traces from interesting data into actionable root cause evidence.

The mechanism is resource attributes. Set service.version on your OTEL resource when the application starts (pull it from your CI/CD environment at build time, inject it as an environment variable). Do the same with deployment.environment and any custom attributes that identify the build — git commit SHA, deploy timestamp, deploy pipeline run ID. When you then query traces, you can filter by version and immediately see "all traces from order-service v2.4.1" without needing to cross-reference external systems manually.

This is the connection point between OTEL's signal collection and Devloom's root cause analysis: once service.version is set consistently, we can automatically correlate metric and trace degradation with specific deploys without requiring engineers to do that lookups manually during an incident. The plumbing only works if the version attribute is set at service startup, not defaulted to "unknown" or omitted entirely — which is unfortunately the default in most auto-instrumentation setups.

A realistic migration timeline

If you're migrating from an existing instrumentation setup (Jaeger client libs, Zipkin, StatsD, custom metrics), plan for a parallel-run period. Run both old and new pipelines simultaneously, validate that the OTEL pipeline produces equivalent data, then cut over service by service rather than all at once.

The OTEL Collector has receivers for most legacy formats (Zipkin, Jaeger Thrift, StatsD, Prometheus) precisely to support this. You can receive legacy format from existing services, transform in the Collector, and export to your new OTEL-native backend — giving you OTEL semantics in your backend before all your services are migrated.

Realistically, for a team running 50-80 services: 2-3 months to instrument the most critical services with auto-instrumentation + manual enrichment on the core business paths, another 2-3 months to reach full coverage, plus ongoing tuning of sampling rates and Collector resource allocation. It's not a weekend project. But the payoff — correlated traces, metrics, and logs with deployment context attached — is the difference between incident triage that takes hours and incident triage that takes minutes.