Blog / Reliability

Retry Logic Best Practices: Exponential Backoff, Jitter, and Circuit Breakers

Retry logic is one of the most commonly implemented and most commonly incorrect pieces of infrastructure in backend systems. The naive version — try three times with a 1-second sleep between attempts — works in development and causes incidents in production. Understanding why requires understanding what makes a retry strategy correct versus dangerous.

What Naive Retry Gets Wrong

The most dangerous property of naive retry is synchronized amplification. When a downstream service is struggling — high latency, intermittent 503s, database contention — every client that was waiting retries at the same interval. This creates coordinated load spikes exactly when the downstream service is least able to handle them. Instead of the service recovering, it receives wave after wave of retried requests from all clients simultaneously, each wave larger than the last. The service that could have recovered in 30 seconds takes 5 minutes to recover because the retries keep it saturated.

The second problem is retrying non-retryable errors. A 400 Bad Request means your request is malformed — retrying it identically will produce another 400. A 401 Unauthorized means your auth token is expired or invalid — retrying with the same token produces another 401. Naive retry code that doesn't distinguish between 4xx and 5xx errors wastes retries on errors that will never succeed without a code or credential change.

Exponential Backoff

Exponential backoff increases the wait time between retries as a power of the attempt number: attempt 1 waits 1 second, attempt 2 waits 2 seconds, attempt 3 waits 4 seconds, attempt 4 waits 8 seconds. The base and exponent are configurable.

const waitMs = Math.min(
  maxWaitMs,        // cap at a reasonable maximum
  baseMs * (2 ** attempt)
);

Exponential backoff gives the downstream service progressively more recovery time. It also reduces total load from retrying clients faster than the downstream service can serve them — each retry attempt costs less on average than a linear retry strategy.

The cap on maxWaitMs is important. Without a cap, a client that hits an error on attempt 10 with a 1000ms base waits ~17 minutes between retries. That's not acceptable for an integration that needs to complete within a reasonable time window. A cap of 30–60 seconds is typical.

Adding Jitter

Jitter randomizes the wait time to desynchronize concurrent retrying clients. Without jitter, all clients that receive a 503 at the same time will retry at the same intervals — the thundering herd problem described above. With jitter, each client picks a random wait within a range, spreading the retry load over time.

There are several jitter strategies. The most effective for this use case is full jitter:

const waitMs = Math.random() * Math.min(maxWaitMs, baseMs * (2 ** attempt));

Full jitter picks a random value between 0 and the exponential wait ceiling. It provides maximum desynchronization. The tradeoff is that a client might get a very short wait on a long-failed attempt — but since the attempts are already failing, a short wait followed by another failure costs little.

Decorrelated jitter is an alternative that preserves average wait times while adding randomness. AWS recommends it for their retry guidelines. The choice between full and decorrelated jitter matters at large scale (thousands of concurrent retrying clients); for most enterprise API integrations, full jitter is sufficient.

Circuit Breakers

A circuit breaker is a state machine that tracks downstream service health and stops sending requests when the service is clearly unavailable. The three states: closed (normal operation, all requests pass through), open (service is failing, all requests fail immediately without being sent), half-open (service may have recovered, allow a probe request through to test).

Circuit breakers serve a different purpose than retry logic. Retry logic handles transient individual failures. A circuit breaker handles sustained downstream unavailability. Without a circuit breaker, a client with retry logic will spend 30 minutes retrying requests to a service that's been down since midnight — burning queue space, holding connections, and producing log noise. A circuit breaker trips after a configurable failure threshold and stops sending new requests until the service shows signs of recovery.

For most enterprise API integrations, a circuit breaker isn't necessary unless you're operating at high throughput or have strict SLA requirements. For lower-volume integrations, exponential backoff with a reasonable maximum attempt count and max wait time achieves similar protection without the circuit breaker state management complexity.

Retry at the SDK Layer

The Devloom SDK applies retry logic at the connector layer with per-connector configuration. The default strategy is exponential backoff with full jitter, a 5-second base, a 60-second cap, and 5 maximum attempts. 429 errors are handled separately from 5xx errors — 429 reads the Retry-After header and waits accordingly, 5xx applies the backoff strategy.

You can override the retry configuration per client:

const client = dlx.client({
  connector: 'orbis-erp',
  auth: dlx.auth.oauth2(),
  retry: {
    maxAttempts: 8,
    baseMs: 2000,
    maxWaitMs: 120_000,
    strategy: 'exponential',
    jitter: 'full'
  }
});

Or disable retry entirely for use cases where you want to handle it yourself:

const client = dlx.client({
  connector: 'fieldvault-crm',
  auth: dlx.auth.oauth2(),
  retry: { maxAttempts: 0 }
});