How does batching prevent double billing on a resubmission?

Each record is fingerprinted by co_id and its sorted attachment content hashes and checked against a persisted seen-set before routing, so replayed payloads are skipped. Persist the seen-set in Redis or the database to survive worker restarts.

Async Batching Workflows

Construction project tracking and change order automation demand resilient pipelines that ingest high-volume, asynchronous document streams without compromising financial accuracy or schedule integrity. The specific problem this page solves is the document burst: during monthly pay-application windows and competitive bid periods, field teams submit RFIs, submittals, and change order requests faster than synchronous parsers can absorb them, and a single slow OCR pass or oversized attachment stalls the entire request thread. Async batching decouples document receipt from validation, extraction, and downstream routing inside the broader Automated Document Ingestion & Parsing architecture, so that a change order submitted from a job-site trailer on intermittent LTE is buffered, validated, and routed deterministically rather than dropped or double-counted. This page covers how to build that buffering-and-batching layer in Python, how to wire it to the confidence-driven routing states used across this pipeline, and how to verify and troubleshoot it under production load.

Prerequisites

Before implementing the batching layer, confirm the following packages, infrastructure, and upstream assumptions are in place. The async core depends only on the standard library, but production deployments add a distributed broker and the parsing stack shared by the rest of the ingestion pipeline.

Python 3.11+ — for the improved asyncio task groups, TaskGroup, and exception groups used in worker supervision.
pydantic>=2.5 — all schema contracts use Pydantic v2 (field_validator with @classmethod, model_dump_json).
A broker for horizontal scale — redis>=5 or RabbitMQ when a single process is insufficient; the in-process asyncio.Queue shown here is the development and single-worker path.
pandas>=2.1 and openpyxl — for tabular submittal logs handed to the parsing step (see Batch processing Excel submittal logs with Pandas DataFrames).
pdfplumber / pytesseract — only on the worker tier, for the OCR preprocessing fallback on scanned change orders.

Upstream assumptions: every payload entering the queue already carries a stable content hash and a document-type hint produced by the classification gateway, and the canonical change order contract is governed by the schema validation rules shared across the pipeline. The batching layer never invents schema; it enforces the existing one and defers cost-code semantics to WBS mapping strategies.

Architecture Detail

The batching subsystem has four inputs (field-app submissions, ERP webhooks, inbound email, and resubmitted dead-letter items), a buffering queue, a schema gate, a batch assembler that groups by routing tier, and a worker pool that performs parsing, enrichment, and confidence-driven dispatch. Error branches route invalid payloads to a dead-letter queue and low-confidence extractions to a quarantine state for human review.

The routing decision after extraction is governed by site-canonical confidence thresholds that recur throughout this pipeline: a field-extraction confidence of 0.92 or higher auto-routes, 0.75 to 0.92 diverts to human review, and below 0.75 quarantines the record so a document-control specialist can reconcile it before any cost data reaches the ledger.

Schema-First Contract Design

A robust batching system begins with a strict contract before any document enters the queue. Change orders and tracking logs must conform to a validated schema that enforces data types, required fields, and construction-domain patterns. Define canonical fields such as co_id, originating_contractor, cost_delta, schedule_impact_days, approval_tier, cost_code, and attachment_hashes. Cost deltas must be parsed as Decimal to eliminate floating-point drift in cumulative budget tracking, and attachments are referenced by immutable content hashes rather than mutable filenames, preserving audit integrity when revisions circulate across PDF/Excel Sync Pipelines.

Construction-domain constants are encoded as Literal types and regex-validated fields rather than free strings: the cost code follows the MasterFormat XX XX XX six-digit pattern, the WBS element follows PROJ-NNN-DIV-NN, and discipline and status are closed enumerations. State-tracking fields (processing_status, extraction_confidence, validation_errors) make retries idempotent and prevent duplicate billing when a contractor resubmits an identical payload.

from decimal import Decimal
from typing import List, Literal, Optional, Tuple

from pydantic import BaseModel, Field, ValidationError, field_validator

Discipline = Literal["ARCH", "STR", "MEP", "CIV", "ELEC", "PLMB"]
ProcessingStatus = Literal[
    "pending_validation",
    "validated",
    "auto_routed",
    "human_review",
    "quarantined",
    "validation_failed",
    "failed",
]


class ChangeOrderItem(BaseModel):
    co_id: str = Field(..., min_length=5, max_length=20, description="Unique change order id")
    originating_contractor: str = Field(..., min_length=2)
    cost_delta: Decimal = Field(..., ge=0, description="Positive cost impact in USD")
    schedule_impact_days: int = Field(default=0, ge=0)
    approval_tier: int = Field(..., ge=1, le=5, description="Routing tier by dollar threshold")
    # MasterFormat division code, e.g. "03 30 00" (cast-in-place concrete)
    cost_code: str = Field(..., pattern=r"^\d{2} \d{2} \d{2}$")
    # WBS element, e.g. "PROJ-014-STR-02"
    wbs_element: str = Field(..., pattern=r"^PROJ-\d{3}-(ARCH|STR|MEP|CIV|ELEC|PLMB)-\d{2}$")
    discipline: Discipline
    attachment_hashes: List[str] = Field(default_factory=list, description="SHA-256 of supporting docs")
    extraction_confidence: float = Field(default=1.0, ge=0.0, le=1.0)
    processing_status: ProcessingStatus = "pending_validation"
    validation_errors: List[str] = Field(default_factory=list)

    @field_validator("cost_delta", mode="before")
    @classmethod
    def coerce_decimal(cls, value: object) -> Decimal:
        # European decimal formats ("45.000,50") silently corrupt totals if
        # passed straight to Decimal; normalise the separators first.
        if isinstance(value, str) and "," in value and "." in value:
            value = value.replace(".", "").replace(",", ".")
        return Decimal(str(value))

    @classmethod
    def validate_payload(
        cls, raw: dict
    ) -> Tuple[bool, Optional["ChangeOrderItem"], List[str]]:
        try:
            return True, cls(**raw), []
        except ValidationError as exc:
            errors = [f"{err['loc']}: {err['msg']}" for err in exc.errors()]
            return False, None, errors

Step-by-Step Implementation

The batch processor follows a strict lifecycle. Each step below carries a construction-specific rationale, because the failure cost here is misrouted money, not a dropped log line.

Ingest and buffer. Raw payloads land in an asyncio.Queue (or a broker topic for multi-worker fan-out). Buffering is what lets a job-site burst of fifty change orders arrive without back-pressuring the field app to a timeout.
Drain in bounded windows. The worker pulls up to batch_size records per cycle (typically 50–200). Bounded windows keep a single oversized scanned submittal from monopolizing memory and smooth traffic spikes during end-of-month pay applications.
Apply the schema gate. Every payload is validated against the contract. Invalid items are routed to a dead-letter queue with structured error payloads rather than dropped, so a contractor’s malformed resubmission is recoverable, not lost.
Assemble batches by routing tier. Valid items are grouped by approval_tier and cost_code so downstream dispatch can fan a single tier-4 batch to executive approvers in one call.
Parse, enrich, and score. Worker coroutines extract line items and attach an extraction_confidence. The confidence drives the route: auto-route at ≥0.92, human review at 0.75–0.92, quarantine below 0.75.
Dispatch deterministically. Auto-routed batches publish to the ERP and approver topics; everything else is parked in its review or quarantine state with the raw evidence preserved.

Idempotency is the load-bearing decision throughout: because retries are inevitable on flaky site connectivity, every record is keyed by its content hash and co_id, so replaying a batch never produces a second ledger entry. The following module is runnable and demonstrates the full lifecycle with structured logging for audit trails.

import asyncio
import logging
from decimal import Decimal, InvalidOperation
from typing import Any, Dict, List, Optional, Tuple

logging.basicConfig(level=logging.INFO, format="%(asctime)s | %(levelname)s | %(message)s")
logger = logging.getLogger(__name__)

AUTO_ROUTE_THRESHOLD = 0.92
QUARANTINE_THRESHOLD = 0.75


class AsyncBatchProcessor:
    def __init__(self, batch_size: int = 50, max_retries: int = 3) -> None:
        self.queue: "asyncio.Queue[Dict[str, Any]]" = asyncio.Queue()
        self.batch_size = batch_size
        self.max_retries = max_retries
        self.processed_count = 0
        self._seen_hashes: set[str] = set()  # idempotency guard

    async def ingest(self, payload: Dict[str, Any]) -> None:
        """Buffer a raw payload; safe to call concurrently from many producers."""
        await self.queue.put(payload)

    def _validate(
        self, raw: Dict[str, Any]
    ) -> Tuple[bool, Optional[ChangeOrderItem], List[str]]:
        try:
            return ChangeOrderItem.validate_payload(raw)
        except (InvalidOperation, TypeError) as exc:
            return False, None, [f"Type coercion failed: {exc}"]

    def _route_by_confidence(self, item: ChangeOrderItem) -> str:
        """Map the site-canonical confidence thresholds to a routing state."""
        if item.extraction_confidence >= AUTO_ROUTE_THRESHOLD:
            return "auto_routed"
        if item.extraction_confidence >= QUARANTINE_THRESHOLD:
            return "human_review"
        return "quarantined"

    async def _process_batch(self, batch: List[ChangeOrderItem]) -> None:
        logger.info("Processing batch of %d validated records", len(batch))
        for item in batch:
            # Idempotency: skip records already routed in a prior retry.
            fingerprint = f"{item.co_id}:{''.join(sorted(item.attachment_hashes))}"
            if fingerprint in self._seen_hashes:
                logger.info("Skipping duplicate %s (idempotent retry)", item.co_id)
                continue
            try:
                await asyncio.sleep(0.01)  # stand-in for OCR / field extraction
                item.processing_status = self._route_by_confidence(item)  # type: ignore[assignment]
                self._seen_hashes.add(fingerprint)
                self.processed_count += 1
                logger.info(
                    "CO %s | tier %d | %s | delta $%s | conf %.2f",
                    item.co_id, item.approval_tier, item.processing_status,
                    item.cost_delta, item.extraction_confidence,
                )
            except Exception as exc:  # noqa: BLE001 - record, never crash the worker
                logger.error("Runtime failure on CO %s: %s", item.co_id, exc)
                item.validation_errors.append(f"Processing error: {exc}")
                item.processing_status = "failed"

    async def _worker_loop(self) -> None:
        while True:
            valid: List[ChangeOrderItem] = []
            invalid: List[Dict[str, Any]] = []

            for _ in range(self.batch_size):
                try:
                    raw = self.queue.get_nowait()
                except asyncio.QueueEmpty:
                    break
                ok, instance, errors = self._validate(raw)
                if ok and instance:
                    valid.append(instance)
                else:
                    raw["validation_errors"] = errors
                    raw["processing_status"] = "validation_failed"
                    invalid.append(raw)

            if not valid and not invalid:
                await asyncio.sleep(0.5)
                continue

            if invalid:
                logger.warning("Routing %d invalid payloads to dead-letter queue", len(invalid))
                # production: publish to DLQ topic or write to an error table

            if valid:
                # Assemble by routing tier before dispatch.
                valid.sort(key=lambda i: (i.approval_tier, i.cost_code))
                await self._process_batch(valid)

    async def run(self) -> None:
        logger.info("Async batch worker started")
        try:
            await self._worker_loop()
        except asyncio.CancelledError:
            logger.info("Worker cancelled; flushing remaining items")
            raise


async def main() -> None:
    processor = AsyncBatchProcessor(batch_size=3)
    payloads = [
        {"co_id": "CO-2024-001", "originating_contractor": "Apex Concrete", "cost_delta": 12500.50,
         "approval_tier": 2, "cost_code": "03 30 00", "wbs_element": "PROJ-014-STR-02",
         "discipline": "STR", "attachment_hashes": ["a1b2c3"], "extraction_confidence": 0.97},
        {"co_id": "CO-2024-002", "originating_contractor": "SteelWorks LLC", "cost_delta": "45000",
         "approval_tier": 4, "cost_code": "05 12 00", "wbs_element": "PROJ-014-STR-07",
         "discipline": "STR", "attachment_hashes": ["d4e5f6"], "extraction_confidence": 0.81},
        {"co_id": "BAD-FORMAT", "originating_contractor": "Unknown", "cost_delta": "not_a_number",
         "approval_tier": 1, "cost_code": "00 00 00", "wbs_element": "INVALID", "discipline": "STR"},
        {"co_id": "CO-2024-003", "originating_contractor": "MEP Solutions", "cost_delta": 8750.00,
         "approval_tier": 1, "cost_code": "23 00 00", "wbs_element": "PROJ-014-MEP-01",
         "discipline": "MEP", "attachment_hashes": ["g7h8i9"], "extraction_confidence": 0.62},
    ]

    await asyncio.gather(*(processor.ingest(p) for p in payloads))

    task = asyncio.create_task(processor.run())
    await asyncio.sleep(1.5)
    task.cancel()
    try:
        await task
    except asyncio.CancelledError:
        pass

    logger.info("Pipeline complete. Processed %d records", processor.processed_count)


if __name__ == "__main__":
    asyncio.run(main())

Deterministic Parsing and Field Normalization

Inside each worker, field extraction combines structured table parsing, regex pattern matching, and an OCR fallback for scanned site reports and legacy contractor forms. The field extraction techniques used here isolate line-item descriptions, unit rates, labor multipliers, and overhead percentages before any calculation logic runs, and they attach the extraction_confidence that the routing step consumes. The parsing layer stays stateless and idempotent: when a document mixes formats — a PDF with an embedded Excel table — the pipeline extracts raw text, applies layout-aware segmentation, and maps values onto the validated schema. Ambiguous or missing fields lower the confidence rather than defaulting to zero, which is what keeps a half-legible scanned tier-4 change order out of the auto-route path and in front of a human.

Schema and Configuration Reference

The table below defines the fields and configuration keys that govern this subsystem. The cost code and WBS element are the two construction-domain anchors that downstream cost allocation depends on.

Field / key	Type	Rule	Construction rationale
`co_id`	`str`	length 5–20	Stable key for idempotent retries
`cost_delta`	`Decimal`	`>= 0`	No float drift in cumulative budget totals
`approval_tier`	`int`	`1`–`5`	Dollar-threshold routing to approvers
`cost_code`	`str`	`^\d{2} \d{2} \d{2}$`	MasterFormat division (e.g. `03 30 00`)
`wbs_element`	`str`	`PROJ-NNN-DIV-NN`	Ties cost to work-breakdown node
`discipline`	`Literal`	ARCH/STR/MEP/CIV/ELEC/PLMB	Closed enum prevents free-text drift
`extraction_confidence`	`float`	`0.0`–`1.0`	Drives the routing decision
`processing_status`	`Literal`	state enum	Idempotent state machine
`batch_size`	config	`50`–`200`	Window size; caps per-cycle memory
`AUTO_ROUTE_THRESHOLD`	config	`0.92`	Auto-route at or above
`QUARANTINE_THRESHOLD`	config	`0.75`	Below this, quarantine

Downstream Routing and Integration Points

Validated, scored batches transition to routing logic aligned with construction financial workflows. Change orders exceeding configured thresholds route to senior project managers or executive approvers based on approval_tier; cost deltas aggregate against active project budgets; and schedule impacts are cross-referenced with baseline Gantt exports. Integration points expose webhook endpoints or message-bus topics for ERP synchronization, so Procore, Autodesk Build, or a custom accounting system receives deterministic payloads. Transient network failures trigger exponential backoff, while structural validation failures stop processing and notify the originating estimator through the same channels described in the pipeline’s error handling protocols.

Verification and Testing

Confirm correct behavior with assertions that exercise the schema gate, the idempotency guard, and the confidence router. The example payloads above are designed so that exactly one record auto-routes, one lands in human review, one is quarantined, and one fails the schema gate.

import asyncio


def test_confidence_router() -> None:
    proc = AsyncBatchProcessor()
    base = dict(
        co_id="CO-TEST-1", originating_contractor="Apex", cost_delta="100",
        approval_tier=1, cost_code="03 30 00", wbs_element="PROJ-001-STR-01",
        discipline="STR",
    )
    high = ChangeOrderItem(**{**base, "extraction_confidence": 0.95})
    mid = ChangeOrderItem(**{**base, "extraction_confidence": 0.80})
    low = ChangeOrderItem(**{**base, "extraction_confidence": 0.50})
    assert proc._route_by_confidence(high) == "auto_routed"
    assert proc._route_by_confidence(mid) == "human_review"
    assert proc._route_by_confidence(low) == "quarantined"


def test_schema_gate_rejects_bad_cost_code() -> None:
    ok, _, errors = ChangeOrderItem.validate_payload(
        {"co_id": "CO-TEST-2", "originating_contractor": "Apex", "cost_delta": "100",
         "approval_tier": 1, "cost_code": "033-000", "wbs_element": "PROJ-001-STR-01",
         "discipline": "STR"}
    )
    assert ok is False and any("cost_code" in e for e in errors)


def test_idempotent_replay() -> None:
    async def run() -> int:
        proc = AsyncBatchProcessor(batch_size=2)
        payload = {"co_id": "CO-DUP", "originating_contractor": "Apex", "cost_delta": "100",
                   "approval_tier": 1, "cost_code": "03 30 00", "wbs_element": "PROJ-001-STR-01",
                   "discipline": "STR", "attachment_hashes": ["x1"], "extraction_confidence": 0.99}
        await proc.ingest(payload)
        await proc.ingest(dict(payload))  # same fingerprint, replayed
        task = asyncio.create_task(proc.run())
        await asyncio.sleep(0.3)
        task.cancel()
        try:
            await task
        except asyncio.CancelledError:
            pass
        return proc.processed_count

    assert asyncio.run(run()) == 1  # duplicate suppressed

Run the module directly (python batch_processor.py) and watch the structured log: you should see one auto_routed, one human_review, one quarantined, and one dead-letter warning for BAD-FORMAT. Use ChangeOrderItem(...).model_dump_json() to snapshot a validated record for golden-file comparisons in CI.

Troubleshooting

All records land in quarantine. Root cause: the parsing step is emitting a default extraction_confidence of 0.0 because the OCR fallback never ran. Fix: confirm pytesseract resolves on the worker tier and that scanned documents are actually reaching the OCR preprocessing branch, not silently failing earlier.
Pydantic rejects valid European change orders. Root cause: amounts like 45.000,50 reach Decimal before separator normalization. Fix: the coerce_decimal validator above swaps the separators; ensure no upstream step casts to float first, which would already have corrupted the value.
Duplicate ledger entries after a retry. Root cause: the broker redelivered a message and the idempotency fingerprint was not persisted. Fix: store _seen_hashes in Redis or the database, not in process memory, so it survives a worker restart mid-batch.
Queue depth grows without bound during bid week. Root cause: a single worker, or batch_size too small for the arrival rate. Fix: scale workers horizontally against a shared broker and raise batch_size toward 200; monitor queue depth as a first-class metric.
Tier-4 change orders auto-route without human sign-off. Root cause: confidence threshold applied uniformly while a tier should force review regardless of score. Fix: gate auto_routed on both extraction_confidence >= 0.92 and approval_tier < 4, sending high-value orders to review even when extraction is clean.

Operational Considerations

Production batching needs rigorous monitoring and durable idempotency. Track queue depth, validation failure rate, and average batch latency as structured metrics, and use content-addressable storage for attachments so identical PDFs and spreadsheets are never parsed twice. Wrap downstream ERP calls in circuit breakers to prevent cascade failures during peak pay-application windows. For authoritative guidance on async concurrency and queue management, consult the official asyncio Queue documentation and the Pydantic v2 documentation. By enforcing strict schema contracts, isolating parsing logic, and routing scored batches deterministically, construction automation teams keep financial accuracy intact while scaling document throughput across distributed project sites.

Frequently Asked Questions

When should I use asyncio versus Celery for document bursts?

For a single worker process handling moderate bursts, an in-process asyncio.Queue is simpler to reason about and has no broker dependency. Reach for Celery or a Redis/RabbitMQ-backed queue once you need horizontal scale across machines, durable retries that survive a crash, or visibility into queue depth from outside the process. The schema gate and confidence-routing logic stay identical; only the transport changes.

How large should each batch window be?

Start at 50 records per cycle and tune upward toward 200 based on the size of attachments and the latency of your OCR step. Larger windows improve throughput and downstream call batching but raise per-cycle memory and the blast radius of a single failed batch. Treat batch_size as a configuration key, not a constant, so it can be adjusted during bid week without a redeploy.

How do the confidence thresholds map to routing states?

Extraction confidence of 0.92 or higher auto-routes the change order to the ERP and approvers; 0.75 to 0.92 diverts it to a human-review queue; and anything below 0.75 is quarantined for a document-control specialist. These thresholds are consistent across the ingestion pipeline so that a record’s fate is predictable regardless of which subsystem scored it.

How does batching guarantee a contractor’s resubmission is not billed twice?

Each record is fingerprinted by its co_id and the sorted set of attachment content hashes. Before routing, the worker checks that fingerprint against a persisted seen-set; a replayed or resubmitted payload with the same fingerprint is skipped. Persist the seen-set in Redis or the database rather than process memory so the guard survives worker restarts.

← Back to Automated Document Ingestion & Parsing

Async Batching Workflows

Explore in this section