How does extraction stay consistent with the data architecture layer?

The MasterFormat XX XX XX and WBS PROJ-NNN-DIV-NN patterns enforced at extraction are the same ones the taxonomy layer expects, so a validated code resolves against WBS mapping and budget code standardization without re-coercion.

Field Extraction Techniques

In construction project tracking, change orders are the highest-risk documentation stream you will automate. They arrive as scanned AIA G701/G702 forms, subcontractor PDFs, Excel takeoff sheets, and fragmented email threads — and every one of them must resolve to a single, type-safe record before it can move a budget total or a schedule date. The specific problem this page solves is the extraction boundary: turning that messy, multi-format input into validated co_id, cost, markup, and schedule-impact fields with a confidence score attached, so a record’s downstream fate is deterministic rather than guessed. Generic optical character recognition alone cannot do this; it produces text, not contract-aware data. What follows is the domain-aware schema design, deterministic parsing logic, and confidence-driven routing that make field extraction the load-bearing layer of an Automated Document Ingestion & Parsing pipeline. Field extraction sits downstream of OCR preprocessing for construction docs and upstream of change order schema validation, so the patterns here are written to hand clean, scored payloads to both.

Prerequisites

Before building the extraction layer, confirm the following packages, infrastructure, and upstream assumptions are in place. The deterministic core depends on the standard library plus Pydantic; the document readers and OCR stack are shared with the rest of the ingestion pipeline.

Python 3.11+ for Literal, decimal, and modern typing support.
pydantic>=2.5 — all schema enforcement uses Pydantic v2 (field_validator with @classmethod, model_dump_json).
pdfplumber>=0.11 or PyMuPDF for coordinate-aware table and text extraction from native PDFs.
pytesseract>=0.3 + Tesseract 5 for scanned forms; the engineering-font and stamp-overlay tuning belongs in OCR preprocessing for construction docs, which this layer assumes has already run.
openpyxl>=3.1 for .xlsx takeoff sheets with merged cells and hidden calculation tabs.
Upstream assumption: documents have already been classified by type and digitized to text. Field extraction does not detect document type; it maps a known change-order document to the canonical schema. Multi-format normalization is orchestrated by PDF/Excel Sync Pipelines before records reach this stage.

Architecture Detail

The extraction subsystem is a funnel: heterogeneous inputs in, one validated and scored record out, with explicit branches for every way a document can fail to resolve. Layout-aware parsing handles structured forms, semantic extraction handles narrative scope text, and a deterministic rule layer locks down the fields that must never be probabilistic — identifiers, cost codes, and currency.

Stage	Input	Processing	Output	Error branch
Read	Scanned form, PDF, XLSX, email	OCR / layout parse / cell map	Raw text + bounding boxes	Unreadable file → dead-letter
Locate	Raw text + boxes	Regex anchors, header mapping	Field candidates (strings)	Anchor not found → low confidence
Coerce	Field candidates	Currency, percent, date normalization	Typed values + confidence	Coercion failure → quarantine
Validate	Typed values	Pydantic schema gate	Validated `ChangeOrder`	`ValidationError` → dead-letter
Route	Validated record	Confidence threshold check	Auto / review / quarantine	n/a

The confidence score is the contract between this subsystem and everything downstream. A record scoring 0.92 or higher auto-routes to financial modeling and approvers; 0.75 to 0.92 diverts to a human-in-the-loop verification queue; below 0.75 is quarantined with structured error metadata. These thresholds are canonical across the ingestion pipeline, so a record’s fate is predictable no matter which subsystem scored it.

Step-by-Step Implementation

The following steps build a production-grade extraction boundary. Each step is runnable in isolation; together they form the funnel described above.

1. Define the canonical change order schema

The schema must anticipate the structural variability of construction documentation while enforcing strict type boundaries. Construction domain constants — MasterFormat division codes, WBS element patterns, and discipline codes — are modeled as Literal types or regex-validated fields so free-text drift cannot enter the ledger.

import re
import logging
from decimal import Decimal, InvalidOperation
from typing import Optional, Literal
from pydantic import BaseModel, Field, field_validator, ValidationError

logging.basicConfig(level=logging.INFO, format="%(levelname)s | %(message)s")
logger = logging.getLogger("field_extraction")

OriginatingDocType = Literal["RFI", "Submittal", "Site Directive", "Owner Directive"]
StatusType = Literal["Draft", "Pending Review", "Approved", "Rejected", "Executed"]
Discipline = Literal["ARCH", "STR", "MEP", "CIV", "ELEC", "PLMB"]

# MasterFormat division code: "XX XX XX" (e.g. 03 30 00 = cast-in-place concrete)
COST_CODE_RE = r"^\d{2} \d{2} \d{2}$"
# WBS element: PROJ-NNN-DIV-NN ties the change to a work-breakdown node
WBS_RE = r"^PROJ-\d{3}-[A-Z]{3,4}-\d{2}$"
# Change order id: 2-4 letter prefix, year, sequence
CO_ID_RE = r"^[A-Z]{2,4}-\d{4}-\d{3,5}$"


class ChangeOrder(BaseModel):
    model_config = {"strict": False, "extra": "forbid"}

    co_id: str = Field(pattern=CO_ID_RE)
    originating_doc_type: OriginatingDocType
    scope_description: str = Field(min_length=50)
    cost_code: str = Field(pattern=COST_CODE_RE)
    wbs_element: str = Field(pattern=WBS_RE)
    discipline: Discipline
    direct_cost: Decimal = Field(ge=0, decimal_places=2)
    indirect_cost: Decimal = Field(default=Decimal("0.00"), ge=0, decimal_places=2)
    markup_pct: float = Field(ge=0.0, le=0.25)
    schedule_impact_days: Optional[int] = Field(default=None, ge=0)
    responsible_party: str = Field(min_length=2)
    status: StatusType
    extraction_confidence: float = Field(ge=0.0, le=1.0)

    @property
    def total_cost(self) -> Decimal:
        return (self.direct_cost + self.indirect_cost) * (Decimal(1) + Decimal(str(self.markup_pct)))

Modeling cost_code against the MasterFormat XX XX XX pattern and wbs_element against PROJ-NNN-DIV-NN means a misread division (a common OCR failure on stamped forms) fails at the boundary instead of silently landing in the wrong budget bucket. The total_cost property keeps markup arithmetic in Decimal so cumulative budget rollups never inherit floating-point drift.

2. Locate fields with deterministic anchors

Construction documents rarely conform to a single template, but the fields that must be exact are also the most regular. Change order IDs follow conventions like CO-2024-089 or PRJ-031-CO-12; cost codes follow MasterFormat. Extract these with compiled regular expressions rather than asking a probabilistic model to guess on every transaction. For tabular line items, coordinate-based bounding-box extraction combined with column-header mapping reliably captures the grid; anchor regexes then pull the deterministic fields out of the surrounding text.

def locate_fields(raw_text: str) -> dict[str, str]:
    """Pull deterministic anchors out of OCR/parsed text. Misses are left absent
    so the coercion step can lower confidence rather than fabricate a value."""
    patterns = {
        "co_id": CO_ID_RE,
        "cost_code": COST_CODE_RE,
        "wbs_element": WBS_RE,
    }
    found: dict[str, str] = {}
    for name, pat in patterns.items():
        # search ignores surrounding stamp/signature noise; anchored ^...$ does not
        m = re.search(pat.strip("^$"), raw_text)
        if m:
            found[name] = m.group(0)
    return found

3. Coerce values and score the extraction

Field-level constraints must account for regional formatting: comma-separated decimals in European subcontractor submissions, mixed currency symbols on joint-venture projects, and 15%-style markup strings. Coercion normalizes these into typed values, and every coercion that has to fall back or guess lowers the confidence score. That score, not a boolean, is what drives routing.

def coerce_currency(raw: object) -> Decimal:
    """Normalize '€14.250,00', '$14,250.00', '14500.50' -> Decimal."""
    s = str(raw).strip().lstrip("$€£").replace(" ", "")
    # European format: thousands '.', decimal ',' -> swap to canonical
    if "," in s and s.rfind(",") > s.rfind("."):
        s = s.replace(".", "").replace(",", ".")
    else:
        s = s.replace(",", "")
    return Decimal(s)


def coerce_markup(raw: object) -> float:
    s = str(raw).strip()
    if s.endswith("%"):
        return float(s.rstrip("%")) / 100.0
    return float(s)


def build_candidate(raw_data: dict, anchors: dict[str, str]) -> dict:
    """Merge anchor hits with parsed cells; track how many fields were guessed
    so extraction_confidence reflects real uncertainty."""
    misses = 0
    confidence = float(raw_data.get("ocr_confidence", 0.99))

    def take(key: str) -> object:
        nonlocal misses
        val = anchors.get(key) or raw_data.get(key)
        if val is None:
            misses += 1
        return val

    candidate = {
        "co_id": take("co_id"),
        "originating_doc_type": raw_data.get("originating_doc_type"),
        "scope_description": raw_data.get("scope_description", ""),
        "cost_code": take("cost_code"),
        "wbs_element": take("wbs_element"),
        "discipline": raw_data.get("discipline"),
        "direct_cost": coerce_currency(raw_data.get("direct_cost", "0")),
        "indirect_cost": coerce_currency(raw_data.get("indirect_cost", "0")),
        "markup_pct": coerce_markup(raw_data.get("markup_pct", "0")),
        "schedule_impact_days": raw_data.get("schedule_impact_days"),
        "responsible_party": raw_data.get("responsible_party", ""),
        "status": raw_data.get("status"),
    }
    # each missing deterministic anchor costs 0.08 confidence
    candidate["extraction_confidence"] = round(confidence - 0.08 * misses, 4)
    return candidate

4. Validate at the boundary and route by confidence

Validation executes synchronously at the extraction boundary so malformed payloads fail fast instead of corrupting downstream financial models. The router then applies the canonical thresholds. Cross-field checks — for example, flagging markup_pct that exceeds the prime contract’s overhead-and-profit cap — run here, before the record is allowed to influence the cost ledger.

AUTO_ROUTE_THRESHOLD = 0.92
QUARANTINE_THRESHOLD = 0.75

RoutingState = Literal["auto_route", "human_review", "quarantine", "dead_letter"]


def extract_and_route(raw_data: dict) -> tuple[RoutingState, object]:
    """Returns (state, payload). On schema failure the payload is a structured
    error dict suitable for the dead-letter queue and audit log."""
    anchors = locate_fields(raw_data.get("raw_text", ""))
    candidate = build_candidate(raw_data, anchors)
    try:
        record = ChangeOrder(**candidate)
    except ValidationError as exc:
        logger.error("Validation failed for %s", candidate.get("co_id"))
        return "dead_letter", {
            "co_id": candidate.get("co_id"),
            "errors": exc.errors(include_url=False),
            "raw": candidate,
        }

    score = record.extraction_confidence
    if score >= AUTO_ROUTE_THRESHOLD:
        state: RoutingState = "auto_route"
    elif score >= QUARANTINE_THRESHOLD:
        state = "human_review"
    else:
        state = "quarantine"

    logger.info("CO %s scored %.3f -> %s (total %s)", record.co_id, score, state, record.total_cost)
    return state, record


if __name__ == "__main__":
    sample = {
        "raw_text": "Change Order PRJ-2024-089 ... 03 30 00 ... PROJ-031-STR-04",
        "co_id": "PRJ-2024-089",
        "originating_doc_type": "Site Directive",
        "scope_description": (
            "Foundation underpinning required due to unexpected soil "
            "liquefaction identified during geotechnical review."
        ),
        "cost_code": "03 30 00",
        "wbs_element": "PROJ-031-STR-04",
        "discipline": "STR",
        "direct_cost": "$14,500.50",
        "indirect_cost": "1.200,00",  # European format
        "markup_pct": "12%",
        "schedule_impact_days": 3,
        "responsible_party": "Acme Excavation LLC",
        "status": "Pending Review",
        "ocr_confidence": 0.97,
    }
    state, payload = extract_and_route(sample)
    if state != "dead_letter":
        print(payload.model_dump_json(indent=2))

Using Literal types for originating_doc_type, status, and discipline gives the same closed-enum guarantee as a custom validator, but without a fragile class-context lookup that breaks when validators run outside the model. model_dump_json produces the canonical serialized record handed to the next stage.

Schema and Configuration Reference

Field / key	Type	Rule	Construction rationale
`co_id`	`str`	`^[A-Z]{2,4}-\d{4}-\d{3,5}$`	Stable key for idempotent retries
`originating_doc_type`	`Literal`	RFI / Submittal / Site Directive / Owner Directive	Closed enum prevents free-text drift
`scope_description`	`str`	min length 50	Forces a usable narrative, not a stub
`cost_code`	`str`	`^\d{2} \d{2} \d{2}$`	MasterFormat division (e.g. `03 30 00`)
`wbs_element`	`str`	`PROJ-NNN-DIV-NN`	Ties cost to a work-breakdown node
`discipline`	`Literal`	ARCH/STR/MEP/CIV/ELEC/PLMB	Routes to the right reviewer
`direct_cost`	`Decimal`	`>= 0`, 2 dp	No float drift in budget totals
`indirect_cost`	`Decimal`	`>= 0`, 2 dp	Separated for audit transparency
`markup_pct`	`float`	`0.0`–`0.25`	Caps at contractual O&P limit
`schedule_impact_days`	`int?`	`>= 0`, nullable	Null when no schedule effect
`status`	`Literal`	Draft…Executed	Idempotent contract-admin state
`extraction_confidence`	`float`	`0.0`–`1.0`	Drives the routing decision
`AUTO_ROUTE_THRESHOLD`	config	`0.92`	At or above, auto-route
`QUARANTINE_THRESHOLD`	config	`0.75`	Below this, quarantine

The MasterFormat and WBS patterns here are deliberately the same ones used by WBS mapping strategies and budget code standardization, so a code that validates at extraction also resolves cleanly when the record reaches the data architecture layer.

Verification and Testing

Confirm correct behavior with assertions that exercise each routing branch and the regional-format coercion paths. Treat the confidence math as a tested contract, not an implementation detail.

def test_european_currency_coerces():
    assert coerce_currency("1.200,00") == Decimal("1200.00")
    assert coerce_currency("$14,500.50") == Decimal("14500.50")

def test_markup_percent_string():
    assert coerce_markup("12%") == 0.12

def test_missing_anchor_lowers_confidence():
    raw = {"raw_text": "", "ocr_confidence": 0.99,
           "direct_cost": "100.00", "markup_pct": "0"}
    cand = build_candidate(raw, anchors={})
    # three missing anchors (co_id, cost_code, wbs) => 0.99 - 0.24
    assert cand["extraction_confidence"] == 0.75

def test_invalid_cost_code_dead_letters():
    state, _ = extract_and_route({
        "co_id": "CO-2024-001", "cost_code": "3-30-0",  # wrong format
        "raw_text": "", "originating_doc_type": "RFI",
        "scope_description": "x" * 60, "wbs_element": "PROJ-001-STR-01",
        "discipline": "STR", "direct_cost": "1.00", "markup_pct": "0",
        "responsible_party": "Sub", "status": "Draft", "ocr_confidence": 0.99,
    })
    assert state == "dead_letter"

Run the module directly (python field_extraction.py) to see a fully populated record serialize via model_dump_json, and run the tests with pytest -q. A clean run proves currency normalization, confidence scoring, and the dead-letter branch all behave as specified.

Troubleshooting

OCR confidence drops below 0.75 on stamped drawings. Root cause: approval stamps and signatures overlay the cost block, corrupting the digits. Fix: route these through the deskew-and-stamp-removal step in OCR preprocessing for construction docs before extraction; never raise the threshold to force them through.
Pydantic ValidationError on European decimal formats. Root cause: 14.250,00 reaches the schema as a raw string and Pydantic reads the dot as the decimal separator. Fix: ensure coerce_currency runs before instantiation — the error means a code path bypassed coercion.
Cost code validates but lands in the wrong budget bucket. Root cause: OCR read 03 30 00 as 08 30 00 (a glazing division), and the pattern still matched. Fix: cross-check the extracted cost_code against the project’s active division list and lower confidence on any code not in scope; pattern validity is not the same as project validity.
markup_pct rejected at the boundary. Root cause: a subcontractor submitted 30% markup, exceeding the le=0.25 contractual cap. This is a correct rejection — route it to error handling protocols for an estimator decision rather than widening the bound.
Records silently double-counted after a resubmission. Root cause: extraction is idempotent only if co_id is the dedupe key. Fix: fingerprint on co_id plus a content hash before routing, consistent with the async queue architecture that batches these records.

Frequently Asked Questions

Why use deterministic regex instead of an LLM for every field?

Identifiers, cost codes, and currency must be exact, auditable, and cheap to run at high volume. Compiled regexes give precision without per-transaction model cost or nondeterminism. Reserve semantic extraction for genuinely unstructured fields like scope_description, where there is no fixed pattern to anchor on.

How is the extraction confidence score actually computed?

Start from the OCR/parse confidence and subtract a fixed penalty for each deterministic anchor the locator could not find. The result drives routing: 0.92 or higher auto-routes, 0.75 to 0.92 goes to human review, and below 0.75 is quarantined. Tune the per-miss penalty against your own false-route rate rather than treating it as a constant.

What happens to a record that fails schema validation?

It never reaches the cost ledger. The router returns a dead_letter state with a structured error payload — offending field, expected pattern, raw value — which feeds the audit log and the manual-review queue handled by error handling protocols. Compliant records in the same batch proceed unaffected.

How does this stage stay consistent with the data architecture layer?

The MasterFormat XX XX XX and WBS PROJ-NNN-DIV-NN patterns enforced here are the same ones the taxonomy layer expects, so a code that passes extraction also resolves against WBS mapping and budget code standardization without re-coercion.

← Back to Automated Document Ingestion & Parsing

For precision handling in financial contexts, see the official Python Decimal Module documentation.

Field Extraction Techniques

Explore in this section