Tooling

Loki log aggregation patterns for microservice architectures

Priya Nakashima October 7, 2025

When we started integrating Loki into Devloom's log ingestion layer, the first mistake we made was treating it like a slower, cheaper Elasticsearch. That framing gets you into trouble fast. Loki's architecture is deliberately alien to the full-text-index model, and the teams that get the most out of it are the ones who internalize the difference early rather than fighting it for six months.

This is a field guide based on what we've worked through running Loki against microservice architectures in the 50-100 service range. If you're earlier than that, most of it still applies. If you're much larger, some of the label-cardinality problems get hairier than we cover here.

The label model: where intuition breaks down

Loki indexes only labels, not log content. That's the foundational constraint everything else flows from. Labels are key-value pairs attached at ingestion time — namespace, pod, app, env are the ones you'll see in most Kubernetes setups. The log line itself is stored compressed and unindexed. When you run a LogQL query, Loki first narrows the search space using label selectors, then scans the matching log streams for your filter expression.

This means label selection directly determines query performance in a way that has no Elasticsearch analog. With Elasticsearch, adding more indexed fields generally makes queries faster. With Loki, adding more labels increases the number of distinct streams, which makes ingestion harder to manage and can actively slow down certain query patterns by fragmenting data that should be contiguous.

The mental model that works: labels are for stream selection, not data enrichment. Ask yourself "will I ever query this as a primary selector?" before making something a label. The answer for request_id is almost always no — that belongs in the log line. The answer for service_name is almost always yes.

Cardinality: the tax you pay for every label

High-cardinality labels are the most common way Loki installations go wrong. A label with thousands of distinct values creates thousands of distinct log streams. Loki has to track each stream separately in its index, which drives up memory pressure in the ingester and query latency in the querier.

The usual culprits in a microservice environment:

Pod names: in Kubernetes, pod names include a unique hash suffix. payment-service-7d9f6b-xkqr2 is a different stream value from payment-service-7d9f6b-mj4ts. If you're labeling on full pod name rather than the deployment name, you've created cardinality proportional to your replica count times your deployment frequency.
User IDs: any per-request identifier as a label is a cardinality disaster. Route these through the log line or use a structured log field.
Trace IDs: same problem. Trace IDs belong in the log content for correlation via LogQL line filters, not as labels.

The practical ceiling we've found for well-managed production setups is roughly 10,000-50,000 active streams. Beyond that, you start seeing ingester memory pressure and chunk flush contention that's hard to tune your way out of. The fix is almost always cardinality reduction at the label model level, not hardware.

Structuring labels for microservice topologies

For a 60-service architecture, we've settled on a label set that looks roughly like this:

namespace: "production"
app: "payment-api"
env: "prod"
cluster: "us-west-2-eks"

That's four labels per stream. It gives us the ability to query across a namespace, isolate to a single service, filter by environment, and separate by cluster. Four to six labels is a reasonable target for most setups. More than eight gets into territory where you should audit what you're actually using in queries.

One pattern that helps in multi-team environments: agree on label names before you have 40 services emitting logs. The label service versus app versus component might seem trivial until you're writing a query that needs to join logs across teams and every team named the thing differently. We standardize on app to match Kubernetes pod label conventions, which lets us correlate log queries with metric queries on the same label set.

We're not saying high-cardinality labels are always wrong. There are cases where a bounded-cardinality label, like region with five possible values, is worth adding even though it multiplies your stream count by five. The calculus is: bounded cardinality with clear query value justifies it. Unbounded cardinality almost never does.

LogQL patterns that actually perform

LogQL has two distinct modes: log queries that return log lines, and metric queries that aggregate log data into time series. The performance characteristics differ enough that it's worth thinking about them separately.

For log queries, the most important habit is label selector first, filter expression second. Always lead with a specific label selector to narrow the stream set before applying line filters:

{app="payment-api", env="prod"} |= "ERROR" | json | latency > 500

Inverting this — scanning broadly and relying on line filters to narrow — turns a fast query into a full index scan. The label selector is your index. Don't skip it.

For metric queries over log data, rate() and count_over_time() are the workhorses. One pattern we use heavily in Devloom's log correlation layer: counting error-rate spikes per service using a sliding window, then comparing the spike timing against deployment events. That's where Loki really earns its place — when you need to ask "did the error rate in this service jump within 10 minutes of this config push?" and you want the answer in a single LogQL metric query rather than a cross-tool join.

Promtail versus the alternatives for label attachment

In Kubernetes, most teams use Promtail as the log shipper, which handles label discovery from pod annotations and Kubernetes metadata. Promtail's pipeline stages let you add, drop, or rewrite labels at the shipper level before logs reach Loki — which is where you can intercept cardinality problems before they become ingestion problems.

The labelallow stage is underused. It lets you explicitly declare which labels Promtail is allowed to attach, acting as a whitelist. If you're picking up Kubernetes metadata automatically and concerned about accidentally inheriting high-cardinality labels from pod annotations, a labelallow stage with your approved four to six labels is a good defensive pattern.

For teams already running OTEL Collector in their pipeline, the OTEL log receiver can push to Loki directly using the otlphttp exporter. This works well when you want a unified collection pipeline rather than running both Promtail and an OTEL Collector sidecar. The tradeoff is that OTEL's Loki exporter does label mapping differently from Promtail — resource attributes become Loki labels by default, which can again create cardinality issues if your OTEL resource attributes include per-request values.

Grafana and the query-layer gotchas

Most teams run Loki alongside Grafana, which means the Explore interface becomes the primary debugging surface during incidents. One thing that bites people repeatedly: the default time range in Explore is "last 1 hour." In a high-ingest environment, this is a lot of data to scan, and if your label selector isn't tight enough, queries will time out or return partial results.

During incidents, discipline around time range selection matters. Start with a narrow window around the alert timestamp — 15 minutes before and after is usually sufficient to establish causality. Widen only if you don't find the signal there. This isn't just about Loki query performance; it's about cognitive load. Scanning an hour of logs when the incident started six minutes ago is noise generation, not investigation.

When we built Devloom's log correlation feature, we designed the log query window to default to the deployment window — specifically, the five minutes before and ten minutes after the last deploy that touched the affected service. That constraint alone cut our false-positive signal rate significantly. Loki's speed is acceptable; it's the breadth of the initial query that kills triage time, not the tool itself.

What Loki won't do well

Full-text search across historical log data is slow in Loki. If your workflow involves "find all logs that mention this user's email address across any service over the last 90 days," Loki is the wrong tool. That pattern requires a full-text index, and you should use Elasticsearch or OpenSearch for it. Loki is optimized for "give me logs from these specific services in this time window, filtered by these conditions" — which is exactly the access pattern you need during incident triage, and not the pattern you need for compliance auditing or user-behavior analysis.

Long-term log retention is also a cost management problem with Loki in high-ingest environments. Loki stores compressed chunks in object storage (S3, GCS, or similar), which is cheap, but the querier still has to retrieve and decompress them for historical queries. For logs older than 30 days, most teams set up a tiered retention policy and accept that queries over old data will be slower.

Loki's label model, understood correctly, is a strength for observability workloads. The teams that get frustrated with it are usually the ones trying to use it for access patterns it was never designed for. For incident-time log correlation across microservices, it's a solid choice — as long as you build the label model before you have a cardinality problem rather than after.