Implementing retry logic for failed API document pulls
Construction document ingestion pipelines routinely pull submittals, RFIs, change orders, and takeoff spreadsheets from vendor platforms such as Procore, Autodesk Construction Cloud, and Bluebeam. Network instability, aggressive vendor rate limits, and transient 5xx errors frequently interrupt these pulls, leaving estimators with incomplete cost data and project managers with stale revision histories. Implementing deterministic retry logic transforms intermittent API failures into recoverable events without manual intervention, ensuring that Automated Document Ingestion & Parsing workflows maintain data integrity across distributed worker nodes.
A production-grade retry mechanism must distinguish between permanent failures and transient faults. Blind retries without backoff amplify vendor throttling, corrupt ingestion queues, and trigger cascading timeouts across downstream parsing services. The architecture must target specific HTTP status codes (408, 429, 500, 502, 503, 504), respect vendor-supplied Retry-After headers, enforce strict connect/read timeouts for large binary payloads, and preserve idempotency when pulling versioned change orders or drawing sets.
Core Retry Architecture for Construction APIs
Construction SaaS platforms frequently return 429 Too Many Requests when concurrent document pulls exceed tenant quotas. Transient infrastructure failures on the vendor side manifest as 502 Bad Gateway or 504 Gateway Timeout. Network partitions between cloud workers and vendor CDNs produce connection resets or read timeouts. A robust retry strategy applies exponential backoff with randomized jitter to prevent thundering herd scenarios, caps maximum attempts to avoid indefinite worker blocking, and implements a circuit breaker pattern when vendor error rates exceed a defined threshold.
The retry predicate must explicitly exclude permanent client errors. Status codes 400, 401, 403, 404, and 422 indicate malformed requests, expired credentials, missing resources, or validation failures that will not resolve through repetition. Retrying these wastes compute cycles and delays pipeline throughput. Instead, the system should immediately route these failures to a dead-letter queue for manual estimator review or credential rotation. For comprehensive guidance on routing and classification, consult the Error Handling Protocols documentation.
Production Python Implementation
The following implementation uses tenacity for declarative retry control and requests for HTTP transport. It handles construction-specific failure modes, respects Retry-After headers, and streams large PDF/Excel binaries without exhausting memory. The code is fully typed, production-hardened, and includes explicit error routing for permanent failures.
import logging
import random
import requests
from typing import Dict, Optional, Union
from requests.exceptions import HTTPError, ConnectionError, Timeout
from tenacity import (
retry,
stop_after_attempt,
wait_exponential,
retry_if_exception,
before_sleep_log,
RetryError
)
logger = logging.getLogger("construction.doc_ingestion")
RETRYABLE_STATUS_CODES = {408, 429, 500, 502, 503, 504}
def is_retryable(exception: Exception) -> bool:
"""Predicate to determine if an exception warrants a retry."""
if isinstance(exception, (ConnectionError, Timeout)):
return True
if isinstance(exception, HTTPError):
response = getattr(exception, "response", None)
if response and response.status_code in RETRYABLE_STATUS_CODES:
return True
return False
def wait_with_retry_after(retry_state) -> float:
"""Respect vendor Retry-After header, otherwise apply exponential backoff with jitter."""
exception = retry_state.outcome.exception()
if isinstance(exception, HTTPError) and exception.response is not None:
retry_after = exception.response.headers.get("Retry-After")
if retry_after:
try:
delay = int(retry_after) if retry_after.isdigit() else 30
return min(delay, 300) # Cap at 5 minutes to prevent worker starvation
except ValueError:
pass
# Fallback: exponential backoff with randomized jitter
base = min(2 ** retry_state.attempt_number, 60)
jitter = random.uniform(0, base * 0.5)
return base + jitter
@retry(
retry=retry_if_exception(is_retryable),
wait=wait_with_retry_after,
stop=stop_after_attempt(5),
before_sleep=before_sleep_log(logger, logging.WARNING),
reraise=True
)
def fetch_construction_document(
url: str,
headers: Dict[str, str],
timeout: tuple = (10, 300)
) -> bytes:
"""
Retrieves a construction document from a vendor API with deterministic retry logic.
Handles rate limits, gateway timeouts, and streams large binaries safely.
"""
response = requests.get(url, headers=headers, timeout=timeout, stream=True)
response.raise_for_status()
# Stream to memory in chunks to prevent OOM on large PDFs/Excel files
payload = bytearray()
for chunk in response.iter_content(chunk_size=8192):
if chunk:
payload.extend(chunk)
return bytes(payload)
def pull_with_dlq_routing(
url: str,
headers: Dict[str, str],
dlq_callback: callable
) -> Optional[bytes]:
"""Wrapper that catches exhausted retries and routes to a dead-letter queue."""
try:
return fetch_construction_document(url, headers)
except RetryError as e:
last_exc = e.last_attempt.exception()
logger.error(f"Retry exhausted for {url}: {last_exc}")
dlq_callback(url, last_exc)
return None
except Exception as e:
logger.critical(f"Unexpected failure on {url}: {e}")
dlq_callback(url, e)
return NoneKey Implementation Details
- Deterministic Predicate: The
is_retryablefunction explicitly filters out4xxclient errors. This prevents wasted compute on malformed payloads or expired OAuth tokens. - Header-Aware Backoff:
wait_with_retry_afterparses theRetry-Afterheader first. If absent, it falls back to exponential backoff with randomized jitter, which is critical when polling multiple vendor CDNs simultaneously. - Memory-Safe Streaming: Large construction drawings and multi-sheet takeoff workbooks can exceed 500MB. Using
stream=Truewith chunked iteration preventsMemoryErrorcrashes on constrained worker nodes. - Dead-Letter Routing: The
pull_with_dlq_routingwrapper isolatesRetryErrorexhaustion and forwards permanent failures to a callback (e.g., AWS SQS DLQ, Celery task, or internal message broker) for project manager or estimator triage.
Integration & Operational Validation
Deploying retry logic requires alignment with pipeline observability standards. Instrument the tenacity logger to emit structured JSON logs containing attempt_number, wait_duration, and final_status. Track the ratio of 429 vs 5xx responses to adjust tenant-level concurrency limits or request vendor quota increases.
When integrating with async batching workflows, ensure that retry delays do not block the event loop. Wrap synchronous HTTP calls in asyncio.to_thread or migrate to httpx with tenacity’s async-compatible decorators. For change order automation, verify that retry attempts preserve idempotency by using consistent If-Match or If-None-Match headers, preventing duplicate revision ingestion.
Validate the implementation against simulated vendor outages using tools like toxiproxy or pytest-httpserver. Assert that:
429responses trigger exactly the number of retries specified inRetry-After404and401failures bypass retries and route immediately to DLQ- Connection timeouts respect the
(connect, read)tuple and do not hang indefinitely - Circuit breaker thresholds (if implemented upstream) pause polling before retry queues saturate
By enforcing strict retry boundaries and routing permanent faults to review queues, construction automation teams eliminate manual reconciliation overhead and maintain accurate, up-to-date document histories across all project phases.