Parsing unstructured PDF change orders with Python and pypdf

This page covers one precise problem: turning the raw text stream of an unstructured PDF change order into typed, validated fields a construction cost ledger can trust. Change orders arrive as scanned forms, multi-column vendor templates, or dynamically generated documents with inconsistent table boundaries, merged cells, and approval blocks wedged between line items. The job here is the extraction layer that sits in front of PDF/Excel sync pipelines: read the PDF, locate the change-order number, cost code, and per-line amounts, and emit a record with a confidence score attached so downstream reconciliation can route it. Get this layer wrong and the failure is silent — a missed line item or a fabricated zero flows straight into a budget variance nobody can explain months later. pypdf (the maintained successor to PyPDF2, renamed at version 3.0) stays the lightweight choice for native text extraction; the difficulty is never the read, it is the deterministic post-processing that survives template drift.

Key rules for change-order extraction

The extractor follows a small set of non-negotiable constraints. Each one exists because violating it corrupts data downstream rather than failing loudly:

Financial values are Decimal, never float. Cast through Decimal(str(value)) so binary rounding never enters the pipeline and corrupts cost aggregations.
Cost codes use the MasterFormat XX XX XX division pattern and WBS elements use the site PROJ-NNN-DIV-NN convention. Both are regex-validated fields, not free strings, so a malformed code is caught at the boundary.
Never return an empty string downstream. An image-only PDF yields "" from pypdf; a naive parser reads that as a zero adjustment. Detect it and route to OCR preprocessing for construction docs.
Every extracted field carries a confidence score in 0.0–1.0. The site-canonical bands are constant across the pipeline: 0.92 and above auto-routes, 0.75–0.92 holds for human review, and below 0.75 is quarantined.
Prefer semantic anchors over coordinates. Positional extraction breaks across printers and page layouts; anchored regex against labels like Change Order No. survives template drift.
Dates normalize to ISO 8601. A 06/27/2026 from one vendor and 27-06-2026 from another must converge on one representation before storage.

Step 1 — Reliable text extraction with pypdf

pypdf reads at the object level, returning raw strings that preserve reading order but discard visual layout. The extractor iterates pages, normalizes whitespace, and refuses to swallow an empty extraction — an image-only page is a routing decision, not a blank to ignore.

import logging
import re
from pathlib import Path

import pypdf

logging.basicConfig(level=logging.INFO, format="%(asctime)s | %(levelname)s | %(message)s")
logger = logging.getLogger("co_pdf_parser")


def extract_raw_text(pdf_path: Path) -> str:
    """Return the concatenated native text layer, or signal that OCR is required."""
    blocks: list[str] = []
    try:
        reader = pypdf.PdfReader(str(pdf_path))
    except pypdf.errors.PdfReadError as exc:
        logger.error("Cannot read PDF %s: %s", pdf_path, exc)
        raise RuntimeError("PDF read error") from exc

    for page_num, page in enumerate(reader.pages, start=1):
        page_text = page.extract_text() or ""
        if page_text.strip():
            # Keep line breaks for the state machine; collapse only intra-line runs.
            blocks.append(re.sub(r"[ \t]+", " ", page_text).strip())
        else:
            logger.warning("Page %d returned no text layer.", page_num)

    text = "\n".join(blocks).strip()
    if not text:
        # An empty string parses to a fabricated zero downstream. Route to OCR instead.
        raise RuntimeError("No extractable text layer; route to OCR preprocessing.")
    return text

Note the deliberate choice to keep newlines: the table parser in Step 3 is a line-oriented state machine, so flattening every page into a single space-joined string — a common shortcut — destroys the row boundaries it depends on.

Step 2 — Metadata via semantic anchors

Unstructured change orders rarely share a template, so the parser locates key-value pairs by the label that precedes them rather than by position. The patterns tolerate construction-specific formatting: $12,345.67, Lump Sum, N/A, and vendor abbreviations. Each match contributes to a field-level confidence score, which is exactly the input that field extraction techniques formalize for the whole pipeline.

from dataclasses import dataclass, field

ANCHORS: dict[str, re.Pattern[str]] = {
    "co_number": re.compile(r"Change\s*Order\s*(?:No\.?|#|Number)\s*:?\s*([A-Z0-9\-]+)", re.I),
    "project_id": re.compile(r"Project\s*(?:No\.?|#|ID)\s*:?\s*([A-Z0-9\-]+)", re.I),
    "issue_date": re.compile(r"Date\s*:?\s*(\d{1,2}[/\-.]\d{1,2}[/\-.]\d{2,4})"),
    "contractor": re.compile(r"(?:Contractor|Subcontractor|Vendor)\s*:\s*([^\n,]+)", re.I),
    "total_amount": re.compile(
        r"Total\s*(?:Change\s*Order|Cost|Amount|Value)\s*:?\s*\$?([\d,]+\.?\d*)", re.I
    ),
}


@dataclass
class RawMetadata:
    fields: dict[str, str] = field(default_factory=dict)
    confidence: float = 0.0


def parse_metadata(text: str) -> RawMetadata:
    """Locate CO metadata by semantic anchor; confidence = fraction of fields found."""
    found: dict[str, str] = {}
    for name, pattern in ANCHORS.items():
        match = pattern.search(text)
        if match:
            found[name] = match.group(1).strip()
    # A document missing its CO number is structurally suspect; weight accordingly.
    confidence = len(found) / len(ANCHORS)
    return RawMetadata(fields=found, confidence=round(confidence, 2))

Step 3 — Line-item reconstruction with a state machine

Change-order tables span pages, omit grid lines, and wrap long scope descriptions across rows. A robust parser tracks state across lines, detects the header, and reconstructs rows until a terminal keyword (totals, signatures, approvals) ends the table. Critically, the entire concatenated text stream is passed in once — processing page-by-page resets the state machine mid-table and drops every row after a page break.

from decimal import Decimal, InvalidOperation

HEADER_RE = re.compile(r"\b(?:Desc(?:ription)?|Scope|Item|Qty|Unit|Rate|Amount|Cost)\b", re.I)
TERMINAL_RE = re.compile(r"\b(?:Sub\s*total|Grand\s*Total|Total|Authorized|Signature|Approved)\b", re.I)
NUMERIC_RE = re.compile(r"\$?([\d,]+\.\d{2}|[\d,]+)")
UNIT_RE = re.compile(r"\b(SF|LF|CY|HR|TON|EA|LS)\b", re.I)


@dataclass
class RawLineItem:
    description: str
    quantity: Decimal | None
    unit: str
    rate: Decimal | None
    amount: Decimal | None


def _to_decimal(token: str) -> Decimal | None:
    try:
        return Decimal(token.replace(",", ""))
    except InvalidOperation:
        return None


def parse_line_items(text: str) -> list[RawLineItem]:
    """State-machine row parser; merges wrapped description lines into the prior row."""
    items: list[RawLineItem] = []
    in_table = False

    for line in (ln.strip() for ln in text.split("\n") if ln.strip()):
        if not in_table and HEADER_RE.search(line):
            in_table = True
            continue
        if in_table and TERMINAL_RE.search(line):
            break  # totals/signature block ends the table
        if not in_table:
            continue

        nums = NUMERIC_RE.findall(line)
        if not nums and items:
            # A line with text but no numbers is a wrapped scope description.
            items[-1].description = f"{items[-1].description} {line}".strip()
            continue
        if not nums:
            continue

        description = re.split(r"\s{2,}|\|", line)[0].strip()
        unit_match = UNIT_RE.search(line)
        items.append(
            RawLineItem(
                description=description,
                quantity=_to_decimal(nums[0]) if len(nums) > 1 else None,
                unit=unit_match.group(1).upper() if unit_match else "EA",
                rate=_to_decimal(nums[1]) if len(nums) > 2 else None,
                amount=_to_decimal(nums[-1]),
            )
        )
    return items

Step 4 — Validate into a typed Pydantic record

Raw extraction yields strings and loose dataclasses; the ledger needs a strict contract. This is the change order schema validation step — the canonical model and its full rule set live in schema validation rules, and the record below mirrors the same field constraints the reconciliation engine imports. Financial fields are Decimal, the cost code is regex-pinned to MasterFormat, the WBS element follows the site pattern bound during WBS mapping, and the extraction confidence rides on the record so routing has a number to act on.

from datetime import date, datetime
from typing import Literal

from pydantic import BaseModel, Field, field_validator

ApprovalStatus = Literal["PENDING", "UNDER_REVIEW", "APPROVED", "REJECTED", "EXECUTED"]


class ParsedLineItem(BaseModel):
    description: str = Field(min_length=1)
    quantity: Decimal = Field(default=Decimal("0"), ge=0)
    unit: Literal["EA", "SF", "LF", "CY", "HR", "TON", "LS"] = "EA"
    rate: Decimal = Field(default=Decimal("0"), ge=0)
    amount: Decimal


class ParsedChangeOrder(BaseModel):
    """Validated extraction output, ready for PDF/Excel reconciliation."""

    change_order_id: str = Field(pattern=r"^CO-\d{4}-\d{3}$")
    cost_code: str = Field(pattern=r"^\d{2} \d{2} \d{2}$")        # MasterFormat XX XX XX
    wbs_code: str = Field(pattern=r"^[A-Z]{2,5}-\d{3}-\d{2}-\d{2}$")
    issue_date: date
    contractor: str = Field(min_length=1)
    approval_status: ApprovalStatus = "PENDING"
    line_items: list[ParsedLineItem]
    confidence: float = Field(ge=0.0, le=1.0)

    @field_validator("issue_date", mode="before")
    @classmethod
    def coerce_iso_date(cls, v: object) -> object:
        """Normalize the vendor date soup to a real date before storage."""
        if isinstance(v, (date, datetime)):
            return v
        for fmt in ("%m/%d/%Y", "%m-%d-%Y", "%d/%m/%Y", "%Y-%m-%d", "%m/%d/%y"):
            try:
                return datetime.strptime(str(v), fmt).date()
            except ValueError:
                continue
        raise ValueError(f"Unparseable issue date: {v!r}")

    @property
    def total_extracted_amount(self) -> Decimal:
        return sum((li.amount for li in self.line_items), Decimal("0"))

The end-to-end driver wires the four stages together behind one error boundary and folds the metadata and line-item confidences into a single score for routing:

from typing import Any


def run_co_parser(pdf_path: Path, *, cost_code: str, wbs_code: str) -> dict[str, Any]:
    """Extract, parse, and validate a change-order PDF into a routable payload."""
    try:
        text = extract_raw_text(pdf_path)
        meta = parse_metadata(text)
        raw_items = parse_line_items(text)
        line_confidence = 1.0 if raw_items else 0.0
        score = round(min(meta.confidence, line_confidence), 2)

        record = ParsedChangeOrder(
            change_order_id=meta.fields.get("co_number", "CO-0000-000"),
            cost_code=cost_code,
            wbs_code=wbs_code,
            issue_date=meta.fields.get("issue_date", "1970-01-01"),
            contractor=meta.fields.get("contractor", "UNKNOWN"),
            line_items=[
                ParsedLineItem(
                    description=i.description,
                    quantity=i.quantity or Decimal("0"),
                    unit=i.unit if i.unit in {"EA", "SF", "LF", "CY", "HR", "TON", "LS"} else "EA",
                    rate=i.rate or Decimal("0"),
                    amount=i.amount,
                )
                for i in raw_items
                if i.amount is not None
            ],
            confidence=score,
        )
        disposition = "AUTO_ROUTE" if score >= 0.92 else "HUMAN_REVIEW" if score >= 0.75 else "QUARANTINE"
        return {"status": "ok", "disposition": disposition, "record": record.model_dump_json()}
    except Exception as exc:  # boundary: never crash the worker on one bad document
        logger.error("Parse failed for %s: %s", pdf_path, exc)
        return {"status": "error", "message": str(exc)}

Common mistakes and gotchas

Flattening the whole PDF into one line. Replacing every \n with a space (a tempting normalization) erases row boundaries, and the state-machine parser then sees one giant line and emits zero items. Normalize intra-line whitespace only and keep newlines for the table layer.
Treating an empty extraction as zero. Scanned change orders have no text layer, so extract_text() returns "". A parser that maps that to a Decimal("0") adjustment writes a fabricated zero into the ledger — worse than failing, because it looks valid. Detect the empty stream and route to OCR.
Using float for currency or implicit coercion on read. Reading $1,234.56 into a float and summing across line items accumulates binary error that surfaces as cents-off totals after a large batch. Cast through Decimal(str(token)) and keep Decimal end to end; the Pydantic model enforces this only if you do not bypass it with a hand-rolled cast.

Where this fits in the pipeline

This extractor is the front door of the reconciliation subsystem. Its ParsedChangeOrder output is exactly what the parent PDF/Excel Sync Pipelines consumes to match each line item against an Excel budget row and commit an idempotent delta. During bid periods and monthly closes a job site submits change orders in bursts, so the driver above runs as a task body behind async batching workflows rather than inline. Anything that does not clear the 0.92 band — a partial metadata match, an image-only page, an out-of-range adjustment — is emitted as a typed disposition and handed to error handling protocols for dead-letter routing and replay. Cost codes are reconciled against the canonical taxonomy defined by budget code standardization so a PDF reference and an Excel row can actually be matched.

Frequently Asked Questions

Why pypdf instead of a coordinate-based extractor like pdfplumber?

pypdf reads the native text layer cheaply and preserves reading order, which is enough when you parse with semantic anchors and a state machine rather than fixed coordinates. Coordinate extractors break across printers and template revisions; anchored regex against labels like Change Order No. survives that drift. For genuinely image-only PDFs neither helps — those route to OCR preprocessing first.

How does the parser handle a change-order table that spans two pages?

Pass the entire concatenated text stream to parse_line_items once, not page by page. The state machine sets in_table = True at the header and stays in that state across page boundaries until it hits a terminal keyword (totals, signature, or approval block). Processing pages individually resets the state and silently drops every row after the first page break.

What happens to a multi-line scope description?

A wrapped description appears as a line with text but no numeric tokens. The parser detects that case and appends the text to the previous line item’s description field instead of creating a new, amount-less row. This keeps the row count accurate and prevents a spurious orphan line in the reconciled output.

How is the confidence score for routing computed?

The driver takes the lower of the metadata confidence (fraction of anchored fields located) and the line-item confidence (whether any rows parsed). A combined score at or above 0.92 auto-routes, 0.75–0.92 goes to human review, and below 0.75 is quarantined. These bands are identical across every subsystem in the pipeline.

Why validate into Pydantic instead of returning the dataclasses?

The dataclasses are extraction scratch space; the ledger needs a strict contract. The ParsedChangeOrder model regex-pins the cost code to MasterFormat and the WBS element to the site pattern, coerces vendor date formats to ISO 8601, and enforces Decimal financials — so a malformed code or unparseable date fails at the boundary rather than corrupting reconciliation downstream.

← Back to PDF/Excel Sync Pipelines