Skip to content
fullstackhero

Concept

HTTP resilience

Polly v8 via Microsoft.Extensions.Http.Resilience — retry, circuit breaker, timeout, hedging — applied to every named HttpClient in the kit.

views 0 Last updated

Microsoft.Extensions.Http.Resilience (Polly v8 under the hood) is wired into every named HttpClient the kit ships. The default pipeline adds retry with exponential backoff on transient failures (5xx, 408, 429), a circuit breaker to fail fast when a downstream is consistently broken, an attempt timeout to prevent hung requests from holding threadpool slots, and an optional hedging strategy for read-only calls where latency variance matters more than load.

What the standard pipeline does

AddHeroPlatform calls AddStandardResilienceHandler on every named client. That adds the following pipeline:

[Outer] Total request timeout (e.g. 30s)
[Retry] Up to 3 attempts on transient errors with exponential backoff + jitter
[Breaker] Circuit breaker — opens after N consecutive failures, half-opens after cooldown
[Attempt] Per-attempt timeout (e.g. 10s)
HttpClient.SendAsync

Each strategy is configurable individually; the defaults are tuned for the typical “internal API to internal API” call shape.

Named clients in the kit

The kit ships these named clients out of the box:

NameWhereUsed by
WebhooksWebhooksModule.ConfigureServicesOutbound webhook deliveries
DefaultImplicit via IHttpClientFactoryAnywhere IHttpClientFactory.CreateClient() is called without a name

The Webhooks module is the most visible consumer — WebhookDispatchJob POSTs payloads via the Webhooks client. The pipeline’s retry strategy is what gives webhook delivery its “retry transient failures, give up on permanent ones” behaviour.

How resilience integrates with [AutomaticRetry]

Hangfire’s [AutomaticRetry(Attempts = 4, DelaysInSeconds = ...)] and the HTTP resilience pipeline are two separate retry layers. The kit uses both, deliberately:

  • HTTP-layer retry handles transient network issues (5xx, timeouts) within a single attempt — fast, in-process, no job queue churn.
  • Job-layer retry handles longer-term outages — minutes-to-hours-scale backoff while the Hangfire job stays in the queue.

For the Webhooks module:

  • 1 HTTP-layer retry runs the configured Polly pipeline (small backoff in seconds).
  • If that exhausts, the Hangfire job fails → [AutomaticRetry] kicks in with 30 s / 2 m / 10 m / 1 h delays.

Together, a single webhook delivery has up to 5 attempts spread over ~1 h 12 min.

Tuning the pipeline

The defaults are sensible. When you need to deviate, configure per-named-client:

services.AddHttpClient("SlowExternalApi", client =>
{
client.BaseAddress = new Uri("https://slow.example.com/");
client.Timeout = TimeSpan.FromMinutes(2);
})
.AddStandardResilienceHandler(opts =>
{
opts.AttemptTimeout.Timeout = TimeSpan.FromSeconds(45);
opts.TotalRequestTimeout.Timeout = TimeSpan.FromMinutes(3);
opts.Retry.MaxRetryAttempts = 5;
opts.Retry.BackoffType = DelayBackoffType.Exponential;
opts.Retry.UseJitter = true;
});

Don’t override globally — the standard defaults are right for the typical case. Customise per-client where the upstream demands it.

When NOT to retry

The resilience pipeline retries on transient failures only:

  • HTTP 408 Request Timeout
  • HTTP 429 Too Many Requests (with Retry-After respected)
  • HTTP 500, 502, 503, 504
  • Connection / DNS / socket exceptions

It does not retry on:

  • HTTP 400, 401, 403, 404 — permanent client errors; retrying won’t change anything.
  • HTTP 405, 406, 409 — semantic conflicts; retrying won’t help.

For specific 4xx codes that should be treated as transient (rare — e.g. some APIs return 423 Locked on transient contention), configure opts.Retry.ShouldHandle with a custom predicate.

Circuit breaker

The breaker opens when too many consecutive failures land in a short window. While open, requests fail fast with BrokenCircuitException instead of waiting for the timeout. After a cooldown, the breaker half-opens — one trial request decides whether to close (resume normal traffic) or stay open (downstream still broken).

This is what stops a flaky downstream from cascading into your threadpool. Watch breaker open / close events in your traces; opens that don’t half-close mean the downstream is genuinely down and needs human attention.

Hedging (for read-only calls)

Hedging fires N parallel requests; the first response wins, the rest are cancelled. Trades load for latency. Don’t enable globally — it doubles the downstream’s load — but for specific high-percentile-sensitive reads it can be worth it:

.AddStandardHedgingHandler(opts =>
{
opts.Hedging.MaxHedgedAttempts = 2; // up to 2 parallel requests
opts.Hedging.Delay = TimeSpan.FromMilliseconds(200); // wait 200ms before firing the hedge
});

Use it for paid third-party APIs only when their tail latency genuinely matters and they can absorb 2× the call volume.

Gotchas

  • HttpClient.Timeout is overridden by the pipeline’s TotalRequestTimeout. Configure timeouts on the pipeline, not on the client.
  • Retry + non-idempotent requests = duplicate creates. If you POST /api/v1/orders and retry on 502, you might create two orders. Combine with Idempotency-Key on every retried POST.
  • Polly’s exceptions wrap downstream exceptions. When catching, look at InnerException to get the actual transport failure.
  • OpenTelemetry sees each attempt as its own span. A request that succeeds on the third try shows three spans in the trace, with two error=true parents. This is correct; ignore the noise in dashboards.