Production integrations send thousands of requests over networks that can drop connections, throttle traffic, or return transient errors. A reusable retry utility lets you apply one consistent backoff policy across every API interaction instead of duplicating retry logic per call.
If you are using gRPC with client credentials, set up the token refresh module before running the examples on this page.
Before building your own utility, understand what the SDK already handles so you do not retry the same error twice.
The Lattice REST SDKs retry automatically on HTTP 5xx, 408, 409, and 429 responses. The SDK retry mechanism applies
exponential backoff starting at 1 second, with a maximum of 60 seconds, with ±20% jitter, and honors Retry-After, retry-after-ms,
and X-RateLimit-Reset headers. The default is 2 retries per call. You can override this per call with request_options={"max_retries": N}.
The gRPC SDKs do not implement retries automatically. You must implement retries for your Lattice integrations.
If you build a retry utility on top of the Lattice REST SDKs, set request_options={"max_retries": 0} on each SDK
call to delegate the entire retry policy to your utility. Without this, both layers retry independently, and
a single transient error can trigger up to (sdk_retries + 1) x utility_retries total attempts.
A single helper function that wraps any API call gives you one place to tune backoff parameters, adjust error classification, and add observability across your integration:
The following utility loops up to max_retries times, sleeping for an exponentially increasing backoff between attempts.
It re-raises immediately on errors that retrying will not fix.
Not all errors benefit from a retry. Retrying a malformed request or an unauthorized call wastes time and can obscure bugs. Classify errors into two buckets, retryable and terminal, before deciding whether to back off or re-raise immediately.
The REST SDKs
This utility catches two broad error categories:
anduril.ApiError with a status_code — Inspect the code to decide.httpx.ConnectError and httpx.RemoteProtocolError — Low-level transport failures that are always retryable.Retryable status codes that should be retried:
404 Not Found — The resource might not yet be visible due to eventual consistency.408 Request Timeout — The server did not receive the complete request in time, often due to transient network latency.409 Conflict — A concurrent write contended with this request, so retrying after a short backoff often succeeds.429 Too Many Requests — The client has been rate-limited, so back off before retrying.5xx Server Error — The server encountered a transient failure that might clear on a subsequent attempt.Terminal status codes that should not be retried:
400 Bad Request — The payload is malformed.401 Unauthorized — Refresh credentials before retrying.403 Forbidden — Retrying will not change permissions.413 Payload Too Large — Reduce the payload size.The gRPC SDKs
Extract the gRPC status code with status.Code(err) and check it against a map[codes.Code]bool.
Retryable codes that should be retried:
UNAVAILABLE — The server is temporarily unreachable.DEADLINE_EXCEEDED — The call timed out before a response was received.RESOURCE_EXHAUSTED — The client has been rate-limited or has exceeded a quota.INTERNAL — The server encountered an unspecified failure that may be transient.ABORTED — A concurrency conflict occurred, so retrying after a short backoff often succeeds.UNKNOWN — An unrecognized error was returned, so treat it as transient.NOT_FOUND — The resource might not yet be visible due to eventual consistency.Terminal codes that should not be retried:
INVALID_ARGUMENT — Fix the request before retrying.PERMISSION_DENIED — Retrying will not grant access.UNAUTHENTICATED — Refresh the token before retrying.FAILED_PRECONDITION — A required precondition is not met.ALREADY_EXISTS — The resource already exists.A single utility keeps retry behavior consistent across your integration, so tuning the backoff policy or adding logging requires only a change in one place.