PDF/Excel Sync Pipelines

Construction cost tracking lives in two formats that refuse to agree. The contractual truth — a change order, a directive, an executed amendment — arrives as a PDF, often scanned, signed, and stamped. The financial model — the running budget, committed costs, and contingency draw-down — lives in an Excel workbook that an estimator edits daily. The specific sub-problem this page solves is keeping those two artifacts in sync without a human re-keying numbers between them: extracting line-item adjustments from a PDF change order, matching each one to the correct row in the budget workbook, and committing a reconciled delta to the cost ledger exactly once. When that reconciliation is manual it drifts — a transposed decimal, a change order applied to the wrong cost code, a duplicate entry from a re-sent email — and the drift surfaces months later as a budget variance nobody can explain. This page is the reconciliation subsystem of the Automated Document Ingestion & Parsing pipeline; it assumes the upstream gateway has already classified each document and that field values arrive with a confidence score attached.

Prerequisites

This subsystem sits downstream of extraction and upstream of the ledger, so it depends on the rest of the stack rather than reinventing it:

Python 3.11+ with pydantic v2 for typed validation, openpyxl for native .xlsx reading, pypdf for native PDF text streams, and the standard decimal module — financial values are never float.
A canonical change-order contract. The reconciliation engine imports the schema rather than defining its own; the authoritative model and its validation rules live in schema validation rules.
Confidence-scored fields. Every extracted value carries a score so the pipeline can route it. The three site-canonical bands apply identically here: 0.92 and above auto-routes to the ledger, 0.75–0.92 holds for human review, and below 0.75 is quarantined. Those scores are produced by field extraction techniques, and scanned PDFs are first run through OCR preprocessing before they reach this stage.
A task queue — Celery with a Redis or RabbitMQ broker — so high-volume reconciliation runs out of band. During bid periods and monthly pay cycles a job site can submit hundreds of change orders in a burst; the queue mechanics are detailed in async batching workflows.
A cost-code taxonomy. Every adjustment must land on a real budget line. Documents are bound to a work breakdown structure code at the gateway, and the budget workbook’s codes are normalized per budget code standardization so a PDF reference and an Excel row can actually be matched.

Architecture: inputs, stages, and error branches

Reconciliation is not a single function call — it is an ordered pipeline where each stage can fail independently, and the job is to make every failure produce a structured, replayable outcome rather than a silent default. The diagram below traces a change-order payload from a raw PDF and a baseline workbook to either a committed ledger delta or a parked exception.

The error branches map to distinct dispositions: a structurally unsound payload is rejected at the schema boundary, an adjustment that cannot be tied to a budget line is quarantined as an orphan rather than guessed at, and a delta whose idempotency key has already been seen is skipped so a re-sent document cannot double-count.

Stage	Input	Output	Error branch
PDF extraction	Raw PDF bytes	Line items + confidence	No text layer → OCR; OCR fail → quarantine
Budget mapping	`.xlsx` workbook	`{cost_code: baseline}`	Missing sheet/columns → reject
Schema validation	Extracted record	Typed `ChangeOrderRecord`	Type/constraint fail → `SCHEMA_*` reject
Cost-code match	Record + baseline map	Matched cost code	No match → `ORPHAN_ADJUSTMENT` quarantine
Delta commit	Matched record	Ledger row	Duplicate key → skip (idempotent)

Step-by-step implementation

The module below is deterministic, strictly typed, and structured so each stage’s failure is observable. Build it up in four steps.

Step 1 — Define the canonical reconciled record

The record is the contract between the PDF and the ledger. Financial fields use Decimal, never float, because binary floating point silently corrupts cost aggregations — summing a few hundred float adjustments will leave you cents off, and on a nine-figure project that becomes a real variance. Construction-domain constants are constrained rather than free strings: the cost code follows the MasterFormat XX XX XX division pattern, the WBS element follows the site PROJ-NNN-DIV-NN convention, and the approval status is a Literal. An idempotency_key rides on every record so the same change order arriving twice (a re-sent email, a queue retry) commits exactly once.

import logging
from decimal import Decimal
from typing import Literal
from pydantic import BaseModel, Field, field_validator, model_validator

logging.basicConfig(level=logging.INFO, format="%(asctime)s | %(levelname)s | %(message)s")
logger = logging.getLogger("pdf_excel_sync")

ApprovalStatus = Literal["PENDING", "UNDER_REVIEW", "APPROVED", "REJECTED", "EXECUTED"]

class ChangeOrderRecord(BaseModel):
    """Canonical schema for a reconciled change-order adjustment."""
    change_order_id: str = Field(pattern=r"^CO-\d{4}-\d{3}$")
    # MasterFormat division code, XX XX XX pattern (e.g. "03 30 00")
    cost_code: str = Field(pattern=r"^\d{2} \d{2} \d{2}$")
    # WBS element pattern: project-sequence-division-subdivision
    wbs_code: str = Field(pattern=r"^[A-Z]{2,5}-\d{3}-\d{2}-\d{2}$")
    original_contract_value: Decimal = Field(ge=0)
    proposed_adjustment: Decimal
    impact_days: int
    approval_status: ApprovalStatus
    # Confidence rides on the record so reconciliation can route it.
    confidence: float = Field(ge=0.0, le=1.0)

    @property
    def idempotency_key(self) -> str:
        # Stable across re-sends: same CO + cost code + adjustment = one commit.
        return f"{self.change_order_id}:{self.cost_code}:{self.proposed_adjustment}"

    @field_validator("proposed_adjustment")
    @classmethod
    def within_contingency(cls, v: Decimal) -> Decimal:
        # Adjustments above the contingency threshold require secondary review;
        # they are valid data but must not auto-route to the ledger.
        if abs(v) > Decimal("500000"):
            raise ValueError("Adjustment exceeds contingency threshold; secondary review required.")
        return v

    @model_validator(mode="after")
    def revised_value_non_negative(self) -> "ChangeOrderRecord":
        if self.original_contract_value + self.proposed_adjustment < 0:
            raise ValueError("Adjustment would drive the contract value below zero.")
        return self

Step 2 — Extract the PDF and map the budget workbook

Each source format has its own failure mode. A scanned change order has no text layer, so the extractor must detect that and route to OCR rather than returning an empty string that later parses to zero. The Excel side is fragile in a different way: estimators insert and delete columns mid-project, so the reader resolves columns by header name, never by fixed index — relying on row[3] is how a budget map silently shifts onto the wrong column after someone adds a “notes” field.

from pathlib import Path
import openpyxl
import pypdf

# Site-canonical routing bands — identical across every subsystem.
AUTO_ROUTE = 0.92
HUMAN_REVIEW = 0.75

def extract_pdf_text(pdf_path: Path) -> str:
    """Return the native text layer, or signal that OCR is required."""
    reader = pypdf.PdfReader(str(pdf_path))
    pages = [p.extract_text() or "" for p in reader.pages]
    text = "\n".join(pages).strip()
    if not text:
        # Never return "" downstream — that parses to zero. Route to OCR.
        raise RuntimeError("PDF has no extractable text layer; route to OCR preprocessing.")
    return text

def map_budget_by_header(xlsx_path: Path) -> dict[str, Decimal]:
    """Map cost code -> baseline value, resolving columns by HEADER not index."""
    wb = openpyxl.load_workbook(str(xlsx_path), data_only=True)
    ws = wb.active
    if ws is None:
        raise ValueError("No active worksheet in budget workbook.")

    header = {str(c.value).strip().lower(): i for i, c in enumerate(ws[1]) if c.value}
    try:
        code_col = header["cost code"]
        value_col = header["baseline value"]
    except KeyError as exc:
        raise ValueError(f"Budget workbook missing required column: {exc}") from exc

    budget: dict[str, Decimal] = {}
    for row in ws.iter_rows(min_row=2, values_only=True):
        code, value = row[code_col], row[value_col]
        if code and isinstance(value, (int, float)):
            # Cast through str so Decimal does not inherit float error.
            budget[str(code).strip()] = Decimal(str(value))
    return budget

Step 3 — Match cost codes, including the renamed-row case

The hard part of reconciliation is that a PDF change order references a cost code that may not match the workbook verbatim. A row gets renamed, a division code is written 03-30-00 in the PDF and 03 30 00 in the sheet, or a subcontractor uses a legacy code. Exact matching alone produces a flood of false orphans. The matcher therefore tries an exact lookup first, then a normalized fuzzy match against CSI MasterFormat divisions — but a fuzzy match only auto-applies when its similarity clears the 0.92 band. Anything weaker is held for human review rather than guessed, because a confidently-wrong match silently moves money to the wrong account.

from difflib import SequenceMatcher

def normalize_code(code: str) -> str:
    """Collapse separators so '03-30-00' and '03 30 00' compare equal."""
    return code.replace("-", " ").replace(".", " ").split("#")[0].strip()

def match_cost_code(record_code: str, budget: dict[str, Decimal]) -> tuple[str | None, float]:
    """Return (matched_code, score). Exact = 1.0; fuzzy must clear 0.92 to apply."""
    if record_code in budget:
        return record_code, 1.0

    target = normalize_code(record_code)
    best_code, best_score = None, 0.0
    for candidate in budget:
        score = SequenceMatcher(None, target, normalize_code(candidate)).ratio()
        if score > best_score:
            best_code, best_score = candidate, score

    return (best_code, best_score) if best_score >= AUTO_ROUTE else (None, best_score)

Step 4 — Reconcile idempotently and route by confidence

The final stage computes the delta and commits it — exactly once. Idempotency is not optional in a construction pipeline: change orders are re-sent, queues retry, and a network blip during a monthly close can replay an entire batch. Keying the commit on the record’s idempotency_key means a duplicate is a no-op, not a double charge. Routing keys off the field confidence and the match score together: both must clear 0.92 to auto-commit, the 0.75–0.92 band goes to review, and anything weaker — or any unmatched code — is quarantined for a document-control specialist. The detailed exception-routing and dead-letter behavior is owned by error handling protocols; this engine just emits typed outcomes for it to act on.

from typing import Any

class ReconciliationEngine:
    """Deterministic, idempotent reconciliation of change orders against a budget."""

    def __init__(self, budget: dict[str, Decimal]) -> None:
        self.budget = budget
        self._committed: set[str] = set()  # idempotency ledger (DB-backed in prod)

    def reconcile(self, record: ChangeOrderRecord) -> dict[str, Any]:
        if record.idempotency_key in self._committed:
            logger.info("Skip duplicate %s", record.idempotency_key)
            return {"status": "SKIPPED_DUPLICATE", "key": record.idempotency_key}

        matched_code, match_score = match_cost_code(record.cost_code, self.budget)
        if matched_code is None:
            logger.warning("Orphan adjustment %s (best score %.2f)", record.cost_code, match_score)
            return {"status": "QUARANTINE", "error_code": "ORPHAN_ADJUSTMENT",
                    "cost_code": record.cost_code, "match_score": match_score}

        routing_score = min(record.confidence, match_score)
        baseline = self.budget[matched_code]
        delta = {
            "cost_code": matched_code,
            "baseline": baseline,
            "adjustment": record.proposed_adjustment,
            "reconciled_total": baseline + record.proposed_adjustment,
            "impact_days": record.impact_days,
        }

        if routing_score >= AUTO_ROUTE:
            self._committed.add(record.idempotency_key)
            logger.info("Committed %s -> %s", record.change_order_id, matched_code)
            return {"status": "COMMITTED", "delta": delta}
        if routing_score >= HUMAN_REVIEW:
            return {"status": "HUMAN_REVIEW", "error_code": "LOW_CONFIDENCE_MATCH", "delta": delta}
        return {"status": "QUARANTINE", "error_code": "EXTRACTION_LOW_CONFIDENCE", "delta": delta}

For a ground-up walkthrough of turning the raw PDF text stream from Step 2 into typed ChangeOrderRecord instances — coordinate mapping, line-item isolation, and rasterized-page fallback — see parsing unstructured PDF change orders with Python and pypdf.

Schema and configuration reference

The reconciled record’s field contract:

Field	Type	Constraint	Notes
`change_order_id`	`str`	`^CO-\d{4}-\d{3}$`	e.g. `CO-2024-042`
`cost_code`	`str`	`^\d{2} \d{2} \d{2}$`	MasterFormat division
`wbs_code`	`str`	`^[A-Z]{2,5}-\d{3}-\d{2}-\d{2}$`	WBS element
`original_contract_value`	`Decimal`	`>= 0`	Never `float`
`proposed_adjustment`	`Decimal`	`abs <= 500000`	Above → secondary review
`impact_days`	`int`	—	Negative = recovered float
`approval_status`	`Literal`	5 enum values	`PENDING`…`EXECUTED`
`confidence`	`float`	`0.0`–`1.0`	Drives routing

Routing and reconciliation configuration keys, used identically across the pipeline:

Key	Value	Meaning
`routing.auto_route_threshold`	`0.92`	At or above: commit to ledger
`routing.human_review_threshold`	`0.75`	In `[0.75, 0.92)`: hold for review
`routing.quarantine_below`	`0.75`	Below: dead-letter queue
`match.fuzzy_min_ratio`	`0.92`	Min similarity to apply a fuzzy code match
`reconcile.contingency_cap`	`500000`	Absolute adjustment cap before secondary review

Verification and testing

The point of the test suite is to prove the pipeline never fabricates a number and never double-counts. Each branch gets a focused assertion.

from decimal import Decimal

def _record(**kw) -> ChangeOrderRecord:
    base = dict(change_order_id="CO-2024-042", cost_code="03 30 00",
                wbs_code="PROJ-014-03-20", original_contract_value=Decimal("150000"),
                proposed_adjustment=Decimal("12500"), impact_days=5,
                approval_status="APPROVED", confidence=0.99)
    return ChangeOrderRecord(**{**base, **kw})

def test_exact_match_commits_once():
    engine = ReconciliationEngine({"03 30 00": Decimal("150000")})
    rec = _record()
    first = engine.reconcile(rec)
    second = engine.reconcile(rec)  # re-sent document
    assert first["status"] == "COMMITTED"
    assert first["delta"]["reconciled_total"] == Decimal("162500")
    assert second["status"] == "SKIPPED_DUPLICATE"  # idempotent

def test_fuzzy_match_below_threshold_quarantines():
    engine = ReconciliationEngine({"26 05 00": Decimal("90000")})
    out = engine.reconcile(_record(cost_code="03 30 00"))
    assert out["status"] == "QUARANTINE"
    assert out["error_code"] == "ORPHAN_ADJUSTMENT"

def test_low_field_confidence_routes_to_review():
    engine = ReconciliationEngine({"03 30 00": Decimal("150000")})
    out = engine.reconcile(_record(confidence=0.80))
    assert out["status"] == "HUMAN_REVIEW"

def test_contingency_cap_rejected_at_schema():
    import pytest
    from pydantic import ValidationError
    with pytest.raises(ValidationError):
        _record(proposed_adjustment=Decimal("750000"))

Run with python -m pytest tests/test_pdf_excel_sync.py -v. A green run proves an exact match commits, a duplicate is skipped, a weak code match quarantines instead of guessing, low confidence routes to review, and an over-cap adjustment is stopped at the schema boundary.

Troubleshooting

Every change order quarantines as an orphan adjustment. The PDF writes cost codes as 03-30-00 while the workbook stores 03 30 00, so exact lookup always misses and the fuzzy score never clears 0.92. Root cause: separator mismatch and an un-normalized budget map. Fix: normalize both sides through normalize_code before comparing, and standardize codes at ingestion per the budget-code-standardization rules so the two sources share one representation.

Reconciled totals are a few cents off after a large batch. The original code read Excel values straight into float and summed them, accumulating binary rounding error. Root cause: float for money. Fix: cast every value through Decimal(str(value)) on read and keep Decimal end to end — the schema already enforces this, so the leak is almost always in a custom reader that bypasses the model.

The same change order is committed twice during a monthly close. A queue retry or a re-sent email replays the record and the ledger double-counts the adjustment. Root cause: commits are not idempotent. Fix: key the commit on idempotency_key and persist the seen-key set in the database (not in process memory, which a worker restart clears), so a replay is a guaranteed no-op.

The budget map silently shifts onto the wrong column. An estimator inserted a “notes” column and a reader using row[3] now pulls the wrong field, so every baseline is wrong but nothing errors. Root cause: positional column access. Fix: resolve columns by header name as map_budget_by_header does, and fail loudly with a clear message when a required header is absent.

A scanned change order reconciles to zero. pypdf returns an empty string for an image-only PDF, which a naive parser treats as a zero adjustment. Root cause: no text-layer detection. Fix: detect the empty extraction and route to OCR preprocessing instead of passing an empty string downstream — a held document is recoverable, a fabricated zero in the ledger is not.

Frequently Asked Questions

Why use Decimal instead of float for cost values?

Binary floating point cannot represent most decimal currency values exactly, so summing many float adjustments accumulates rounding error. On a large contract that drift becomes a real, unexplainable budget variance. Every financial field uses Decimal, and values are cast through Decimal(str(value)) on read so they never inherit float error from the source library.

How does the pipeline avoid double-counting a re-sent change order?

Each record exposes a stable idempotency_key derived from the change-order id, cost code, and adjustment. The reconciliation engine checks a persisted set of committed keys before writing, so a replay — from a queue retry or a re-sent email — is a no-op. The key set must live in the database, not process memory, or a worker restart will let a duplicate through.

What happens when a PDF cost code does not match any budget row?

The matcher tries an exact lookup, then a normalized fuzzy match against MasterFormat divisions. A fuzzy match only applies when its similarity clears the 0.92 band; anything weaker, or no candidate at all, is quarantined as an ORPHAN_ADJUSTMENT for a document-control specialist. The pipeline never applies a low-confidence match, because moving money to the wrong account silently is worse than holding the document.

Why resolve Excel columns by header instead of index?

Estimators insert and delete columns throughout a project. A reader using a fixed index like row[3] will silently read the wrong column the moment someone adds a field, corrupting every baseline without raising an error. Resolving by header name keeps the map correct and fails loudly when a required column is genuinely missing.

How do confidence scores decide where a record goes?

Routing uses the lower of the field-extraction confidence and the cost-code match score. Both must reach 0.92 to auto-commit to the ledger. A combined score in [0.75, 0.92) holds for human review, and anything below 0.75 is quarantined to the dead-letter queue. These three bands are used identically across every subsystem in the pipeline.

← Back to Automated Document Ingestion & Parsing

PDF/Excel Sync Pipelines

Explore in this section