Engineering

Shipping Devloom 0.9: what we built and what we got wrong

Marcus Webb
Abstract launch visualization with growth trajectory

We shipped Devloom 0.9 three weeks ago. The release went smoothly by our standards — one post-deploy incident with a 40-minute resolution time, which, given what this release contained, I'll take. This post is the engineering retrospective we usually keep internal, shared externally because we've found that public post-mortems on product development decisions are more useful to other builders than polished feature announcements.

The headline features in 0.9: blast radius mapping that surfaces dependency impact before you start remediation, remediation diffs that show a before/after preview of any suggested fix, and a rebuilt confidence scoring system that replaced the heuristic-heavy v1 with a statistical model trained on our incident history. Here's what worked, what didn't, and what we're changing for 1.0.

Blast radius mapping: what worked

Blast radius mapping is the feature we've been most excited about since we started building it in Q4 last year. The problem it solves: when you get paged for service A, your first question after "what's wrong?" is "what else is breaking because of this?" That question currently requires manual dependency graph navigation — checking your service mesh telemetry, checking which services call the affected one, checking which services the affected one calls. It's 5-10 minutes of mechanical work that happens at the worst possible moment.

The 0.9 implementation does this automatically: when an incident fires, Devloom traces the service dependency graph from the affected service outward, scores each downstream service by its current error/latency state, and renders a visual impact map within 30 seconds of alert receipt. Red nodes are actively degraded. Amber nodes show elevated error rates that may not yet have fired their own alerts. Gray nodes are clean.

What worked well in early usage: the visual immediately answers "is this contained?" for the most common incident class — a single-service regression with clean boundaries. Engineers told us this alone was saving them five to eight minutes per incident in the orientation phase. For contained incidents, the map confirmed containment quickly. For spreading incidents, it surfaced the spread before downstream teams were paged.

What surprised us: the dependency graph quality turned out to matter more than we expected. Teams that had well-maintained service catalogs (service registry with up-to-date dependency declarations) got much better blast radius maps than teams relying purely on observed traffic patterns. Inferred dependencies from traffic data are almost complete but not quite — low-traffic paths, async dependencies, and emergency fallback routes are often invisible to traffic-based inference. For 1.0, we're building a service catalog import that lets teams explicitly declare dependencies and mark them as critical vs. optional.

Remediation diffs: the feedback was clear

Remediation diffs were the feature I was least confident about going into 0.9. The concept: when Devloom suggests a remediation action (rollback to revision X, apply config change Y), instead of presenting the action as a command, present it as a diff — here's exactly what changes, here's what the new state will look like versus the current state.

The feedback from the first three weeks has been unambiguous: engineers trust the suggestions more when they can see the change. Not because the diff contains information they couldn't get by reading the command, but because the visual format of "line removed / line added" activates a different review mode than "command to execute." It slows engineers down slightly at the decision point, which turns out to be good — more questions asked before applying, fewer "this didn't work, now what" moments after applying.

The implementation was harder than expected. Generating a diff for a config revert is straightforward — we have both versions in the event stream. Generating a diff for a Kubernetes rollback is more involved because the deployment spec changes across multiple files, and some of those files (ConfigMaps, Secrets) may or may not be included in the rollback depending on how the change was deployed. We made a conservative call: show the deployment spec diff clearly, flag any adjacent ConfigMap changes with a warning that they won't be reverted. This errs toward explicit incompleteness rather than false completeness.

Confidence scoring v2: what we got wrong in v1

The old confidence scoring system was embarrassing in retrospect. It was a weighted sum of feature flags: if the alert correlated with a recent deploy, +30 confidence points. If error rate was above 5%, +20 points. If the affected service had prior incidents with similar symptoms, +15 points. The weights were picked by us based on intuition and never rigorously validated.

The result: scores that looked like probability estimates but weren't. A score of 78 didn't mean 78% likely to be the root cause; it meant we had chosen weights that summed to 78 for this pattern. Users rightly told us the scores felt arbitrary, because they were.

v2 replaced this with a calibrated classifier trained on our incident history. Specifically, on incidents where the eventual confirmed root cause was known. We used the same features but learned the weights from data rather than setting them by hand. The output is calibrated to actual precision: a confidence score of 80 means the model's 80-confidence predictions are correct about 80% of the time on held-out incidents. We validated this on a held-out set before shipping.

The practical difference: users now treat high-confidence suggestions as strong signal rather than as a number to ignore. Low-confidence suggestions come with a visible note that Devloom is less certain and the engineer should do more investigation before acting. This is more honest and more useful than false precision.

What we got wrong in v2: we underweighted the role of time-of-day in incident patterns. The same symptom profile at 9am (post-morning deploy window) has a different likely cause distribution than the same profile at 3am (when no deploys happen but cron jobs run). We didn't include time features in v2 and we're already seeing cases where the model is confidently wrong on overnight incidents that are cron-related rather than deploy-related. This goes into the v2.1 training run.

The post-deploy incident: what happened

The 0.9 release itself had an incident — approximately 40 minutes of degraded blast radius map rendering for teams on our multi-region deployment. The root cause was a dependency version mismatch in our graph rendering service that was introduced in the 0.9 branch and not caught in staging because our staging environment doesn't mirror the multi-region traffic distribution that triggered it.

We caught it via our own product, which at least confirmed that the alert-to-context workflow works. The blast radius map for our own infrastructure showed the rendering service degraded with clean boundaries — no other services affected. The remediation suggestion was accurate: roll back the graph renderer to 0.8.4 while we investigated. We applied it and the incident resolved.

The staging gap is a structural problem we haven't fixed. Full multi-region traffic simulation in staging is expensive, and the failure mode we hit was a low-probability interaction. Our plan for 1.0 is a staged rollout with automatic halt on error rate deviation rather than a full staging fix — more monitoring gates, not a more faithful staging environment.

What's next for 1.0

The three things going into 1.0 that came directly from 0.9 feedback:

First, service catalog integration for better blast radius fidelity, as described above. We're starting with import from existing PagerDuty service catalogs and Backstage, since those are the two most common sources in the teams we talk to.

Second, post-incident summary generation. Right now Devloom captures the incident timeline but doesn't generate the post-mortem document. We're building a first-draft generator that takes the incident timeline, the confirmed root cause, and the remediation applied and produces a structured post-mortem draft. The engineer fills in the "contributing factors" and "action items" sections. Mechanical template filling is the part that gets skipped under deadline pressure; we'd rather generate the first pass and let engineers edit it than have a blank template that doesn't get filled out.

Third, integration with internal deployment gates — specifically, surfacing Devloom's blast radius map as a check in the deployment pipeline before you merge, not just after the alert fires. If a proposed deploy touches a service with currently-degraded downstream dependencies, that's information worth having before you push.

We're a small team. Three engineers writing production code and one person splitting time between engineering and everything else that isn't engineering. 0.9 was the largest release we've done by surface area. Most of what went right went right because we were conservative about scope and honest about what we didn't know. That approach doesn't change for 1.0.