Submittal Metadata Frameworks

A submittal is the contractual proof that what gets installed matches what was specified — the shop drawing, the product data sheet, the material sample, the manufacturer certification. The specific sub-problem this page solves is how a pipeline turns a submittal package arriving as a scanned PDF, a vendor spec sheet, or a portal export into a typed, machine-readable object whose cost and revision metadata can trigger change order automation deterministically. When that metadata stays trapped inside unstructured documents, estimators lose visibility into cost deltas, project managers cannot enforce approval deadlines, and a substitution slips through against the wrong specification section. Inside a deterministic construction data architecture and taxonomy, the submittal framework is the contract that makes a submittal routable: it pins down identity, classification, revision lineage, and quantified cost impact so an approved substitution can generate a draft change order without a human re-keying the package. This page details the ingestion-to-routing pipeline for that contract — the schema itself, the idempotent normalization that resolves a submittal to canonical scope, and the confidence-scored matching that decides whether a record auto-routes, waits for review, or is quarantined. It targets Python automation builders, project engineers, and estimators who need predictable submittal data under real-world input variance.

Prerequisites

This subsystem sits downstream of document extraction and upstream of the change order ledger and the approval router. Before implementing the patterns below, you need:

Python 3.11+ with pydantic v2 for typed validation, plus the standard-library decimal, re, difflib, enum, uuid, datetime, and logging modules. No floating-point money touches a cost field; every monetary value is a Decimal.
A canonical scope taxonomy to resolve against. Each submittal must bind to the same Work Breakdown Structure that drives WBS mapping strategies; the submittal carries what was proposed and which spec section governs it, the WBS node carries where that scope sits in the project and budget.
A standardized cost vocabulary so quantified deltas post against real accounts. Cost fields reference the canonical keys defined by budget code standardization rather than ad-hoc cost labels copied off a vendor quote.
A task queue — Celery on a Redis or RabbitMQ broker — so malformed or low-confidence submittals can be parked in a dead-letter queue and replayed instead of dropped. The escalation policy for parked and SLA-breached records is owned by fallback alert routing.
An upstream extraction step that has produced raw field strings and a per-field confidence score. Submittals lifted from scanned stamped drawings carry confidence metadata from the OCR preprocessing stage; the routing logic below depends on it.

The pipeline assumes inbound payloads have already cleared structural schema validation rules at the API gateway, so the work here is submittal-specific normalization, cost-impact validation, and routing — not raw document parsing.

Architecture: lifecycle, inputs, and routing

A submittal is not a static document; it is a record that moves through a finite set of review states, and every transition is a place where automation either advances the record or parks it. The schema has to support two orthogonal concerns at once: a review lifecycle state machine and an ingestion pipeline that normalizes and routes each inbound revision. Keeping these separate is what lets a high-cost substitution escalate without breaking the package’s revision invariants. Unlike a one-shot document, a submittal commonly loops — Revise and Resubmit sends the package back through the same states under a new revision id — so the lifecycle must model that cycle explicitly rather than assuming linear progress. The state machine below governs the legal transitions a single submittal package may take.

The review lifecycle is two-terminal and cyclic: Approved and Approved as Noted archive to For Record, Rejected ends on its own, and Revise and Resubmit loops back to Submitted under a fresh revision id rather than assuming linear progress.

The ingestion pipeline runs orthogonally to the lifecycle: a raw payload flows through classification normalization, WBS resolution, substitution matching, cost-impact validation, and routing. Each stage has its own failure branch, and the routing decision uses the site-canonical confidence bands — a match score of 0.92 or above auto-routes, 0.75–0.92 parses but flags the record for human review, and below 0.75 the record is quarantined to the dead-letter queue rather than committed against a guessed scope or specification section.

The ingestion path is gated twice: the match-confidence bands decide auto-route, human-review, or quarantine, and the routing decision drafts a change order only when an approved revision crosses the cost threshold. Every failure — unknown discipline, sub-0.75 confidence, or a bad cost value — feeds the dead-letter queue, which owns escalation through the fallback alert router.

Stage	Input	Output	Error branch
Classification normalize	Raw discipline/trade/spec strings	Enumerated discipline + spaced CSI section	Unknown discipline → quarantine
WBS resolution	Cleaned spec/location string	Canonical WBS node + confidence	`< 0.75` → quarantine; `0.75–0.92` → review
Substitution match	Proposed product vs specified basis	Match verdict + confidence	`< 0.75` → quarantine for engineer
Cost-impact validation	Unit cost / quantity deltas	Typed `Decimal` impact total	Precision / negative value → quarantine
Routing	Validated `SubmittalRevision`	Auto-file / approval queue / change order	SLA breach → fallback alert router

Step-by-step implementation

Step 1 — Define the submittal schema contract

The schema is a versioned, typed contract. Every payload declares a schema_version at the root so a field addition never silently breaks a downstream consumer, and identity fields (project_uuid, submittal_number, created_at) are immutable once minted. Discipline and review status are controlled vocabularies expressed as Literal types, not free strings, so cross-discipline reporting and routing can aggregate without a fragile string comparison. The CSI MasterFormat section is regex-constrained to the XX XX XX pattern and the WBS element to PROJ-NNN-DIV-NN, so a malformed code is rejected at the boundary. Timestamps are timezone-aware per the ISO 8601 date and time standard; a naive timestamp is rejected because it would corrupt approval-SLA math across project sites in different zones. Crucially, descriptive metadata (what the product is) is kept separate from financial metadata (what the deviation costs), so a re-classification never disturbs a committed cost record.

from __future__ import annotations

import logging
from datetime import datetime, timezone
from decimal import Decimal
from typing import Literal, Optional
from uuid import UUID

from pydantic import BaseModel, Field, field_validator, model_validator

logger = logging.getLogger("submittal.ingest")

Discipline = Literal["ARCH", "STR", "MEP", "CIV", "ELEC", "PLMB"]
ReviewStatus = Literal[
    "submitted",
    "under_review",
    "approved",
    "approved_as_noted",
    "revise_and_resubmit",
    "rejected",
    "for_record",
]
SubmittalKind = Literal["shop_drawing", "product_data", "sample", "certification", "mockup"]

CSI_PATTERN = r"^\d{2}\s\d{2}\s\d{2}$"          # MasterFormat: XX XX XX
WBS_PATTERN = r"^PROJ-\d{3}-[A-Z]{3,4}-\d{2}$"  # e.g. PROJ-014-STR-03


class SubmittalDescriptive(BaseModel):
    """What the submittal *is* — never mixed with money."""
    kind: SubmittalKind
    manufacturer: str = Field(min_length=2, max_length=120)
    model_number: Optional[str] = Field(default=None, max_length=120)
    specified_basis: str = Field(min_length=2, max_length=240)  # product named in the spec
    proposed_product: str = Field(min_length=2, max_length=240)
    certifications: list[str] = Field(default_factory=list)


class SubmittalFinancial(BaseModel):
    """What the deviation *costs* — Decimal only, bound to a budget code."""
    budget_code: str = Field(pattern=r"^[A-Z]{2}\d{4}$")  # canonical, e.g. GL1001
    unit_cost: Decimal = Field(ge=0, decimal_places=2)
    quantity: Decimal = Field(gt=0)
    freight: Decimal = Field(default=Decimal("0.00"), ge=0, decimal_places=2)
    schedule_impact_days: int = Field(default=0, ge=0)


class SubmittalRevision(BaseModel):
    schema_version: Literal["1.0"] = "1.0"
    project_uuid: UUID
    submittal_number: str = Field(pattern=r"^\d{2}\s\d{2}\s\d{2}-\d{3}$")  # CSI-seq
    revision_id: str = Field(pattern=r"^R\d{2}$")                          # R00, R01...
    csi_section: str = Field(pattern=CSI_PATTERN)
    wbs_node: str = Field(pattern=WBS_PATTERN)
    discipline: Discipline
    status: ReviewStatus
    created_at: datetime
    descriptive: SubmittalDescriptive
    financial: SubmittalFinancial
    extraction_confidence: float = Field(ge=0.0, le=1.0)

    @field_validator("created_at")
    @classmethod
    def require_tz_aware(cls, v: datetime) -> datetime:
        if v.tzinfo is None:
            raise ValueError("created_at must be timezone-aware (ISO 8601)")
        return v.astimezone(timezone.utc)

    @model_validator(mode="after")
    def section_matches_number(self) -> "SubmittalRevision":
        # The submittal number is prefixed with its CSI section; they must agree.
        if not self.submittal_number.startswith(self.csi_section):
            raise ValueError("submittal_number CSI prefix does not match csi_section")
        return self

    @property
    def total_cost_impact(self) -> Decimal:
        f = self.financial
        return (f.unit_cost * f.quantity) + f.freight

Step 2 — Normalize classification deterministically

Field staff write the same discipline a dozen ways: "Electrical", "elec", "E". The CSI section arrives as "260500", "26.05.00", or "26 05 00". Normalization is a pure, idempotent transformation — given the same messy input it always yields the same canonical output — because the pipeline retries on broker redelivery and a non-deterministic clean would let one submittal commit under two different scopes. Map variants onto the controlled vocabulary first, then collapse the CSI section into the mandated XX XX XX spacing.

_DISCIPLINE_ALIASES = {
    "architectural": "ARCH", "arch": "ARCH", "a": "ARCH",
    "structural": "STR", "struct": "STR", "s": "STR",
    "mechanical": "MEP", "mech": "MEP", "hvac": "MEP", "m": "MEP",
    "civil": "CIV", "c": "CIV",
    "electrical": "ELEC", "elec": "ELEC", "e": "ELEC",
    "plumbing": "PLMB", "plumb": "PLMB", "p": "PLMB",
}


def normalize_discipline(raw: str) -> Discipline:
    key = raw.strip().lower()
    if key in _DISCIPLINE_ALIASES:
        return _DISCIPLINE_ALIASES[key]  # type: ignore[return-value]
    raise ValueError(f"unknown discipline: {raw!r}")


def normalize_csi(raw: str) -> str:
    digits = re.sub(r"\D", "", raw)
    if len(digits) != 6:
        raise ValueError(f"CSI section needs exactly 6 digits, got {raw!r}")
    return f"{digits[0:2]} {digits[2:4]} {digits[4:6]}"  # XX XX XX

import re belongs at the top of the module; it is shown here beside its use for clarity.

Step 3 — Match the substitution and resolve scope by confidence

A submittal that proposes the exact product named in the specification is a clean approval. A submittal that proposes a substitute has to be matched against the specified basis and routed by how confident that match is. This is where the site-canonical confidence bands govern behavior: an exact or near-exact match (0.92 and above) auto-routes, a 0.75–0.92 score files the record but flags it for an engineer’s review, and below 0.75 the record is quarantined so no substitution is ever accepted against a guessed specification. The same band logic resolves the cleaned spec string to a canonical WBS node, mirroring the approach in RFI schema design.

from difflib import SequenceMatcher

AUTO_ROUTE = 0.92        # >= 0.92  -> auto-route
REVIEW_FLOOR = 0.75      # 0.75-0.92 -> human review; < 0.75 -> quarantine

RoutingState = Literal["auto_route", "human_review", "quarantine"]


def match_confidence(specified: str, proposed: str) -> float:
    a, b = specified.strip().lower(), proposed.strip().lower()
    return round(SequenceMatcher(None, a, b).ratio(), 4)


def classify_confidence(score: float) -> RoutingState:
    if score >= AUTO_ROUTE:
        return "auto_route"
    if score >= REVIEW_FLOOR:
        return "human_review"
    return "quarantine"


def resolve_substitution(rev: SubmittalRevision) -> tuple[float, RoutingState]:
    # Combine extraction confidence with product-match confidence; the weakest
    # signal dominates so a crisp OCR read of the wrong product still gets caught.
    match = match_confidence(rev.descriptive.specified_basis, rev.descriptive.proposed_product)
    combined = round(min(match, rev.extraction_confidence), 4)
    return combined, classify_confidence(combined)

Step 4 — Validate cost impact, then route and trigger change orders

The final stage assembles the routing decision. A revise_and_resubmit or rejected status never generates a change order regardless of cost. Only an approved or approved_as_noted revision whose total_cost_impact crosses the project’s change_order_trigger_threshold produces a draft change order — and even then only when the substitution cleared the auto-route or human-review band. SLA breaches on the review clock are not handled here; they are handed to the fallback alert router so escalation policy lives in one place.

CHANGE_ORDER_THRESHOLD = Decimal("5000.00")  # per-project configurable

RouteAction = Literal["file_for_record", "approval_queue", "draft_change_order", "quarantine"]


def route_submittal(rev: SubmittalRevision) -> RouteAction:
    score, state = resolve_substitution(rev)

    if state == "quarantine":
        logger.warning(
            "submittal.quarantine",
            extra={"submittal": rev.submittal_number, "rev": rev.revision_id, "score": score},
        )
        return "quarantine"

    if rev.status in ("revise_and_resubmit", "rejected"):
        return "approval_queue"  # back to the reviewer; never a change order

    if rev.status in ("approved", "approved_as_noted"):
        if rev.total_cost_impact >= CHANGE_ORDER_THRESHOLD:
            logger.info(
                "submittal.change_order",
                extra={
                    "submittal": rev.submittal_number,
                    "impact": str(rev.total_cost_impact),
                    "state": state,
                },
            )
            # Publish to the broker; never mutate the ledger inline.
            return "draft_change_order"
        return "file_for_record"

    return "approval_queue"  # submitted / under_review still awaiting a verdict

Routing returns an action rather than performing side effects so the function stays testable and idempotent; the broker publish and ledger write happen in a thin outer task that the queue can safely retry.

Schema and configuration reference

Field	Type / pattern	Rule	Why it matters
`submittal_number`	`XX XX XX-NNN`	CSI prefix must equal `csi_section`	Stops a package filing under the wrong spec division
`csi_section`	`^\d{2}\s\d{2}\s\d{2}$`	Normalized to `XX XX XX`	MasterFormat-canonical for downstream cost allocation
`wbs_node`	`PROJ-NNN-DIV-NN`	Must resolve in the master WBS map	Binds the submittal to a budgeted scope element
`revision_id`	`R\d{2}`	Increments on resubmit	Tracks lineage across the Revise/Resubmit loop
`discipline`	`Literal[ARCH,STR,MEP,CIV,ELEC,PLMB]`	Closed vocabulary	Rejects typos that create phantom buckets
`status`	`ReviewStatus` Literal	Drives change-order eligibility	Only approved states can trigger a change order
`unit_cost`/`freight`	`Decimal`, 2 dp, `ge=0`	No floats	Prevents penny drift in cost-at-completion
`quantity`	`Decimal`, `gt=0`	Positive only	A zero-quantity delta is a data error
`budget_code`	`^[A-Z]{2}\d{4}$`	Canonical key	Posts against a real account, not a vendor label
`extraction_confidence`	`float` 0.0–1.0	From OCR stage	Feeds the routing band decision

Routing constants are site-canonical and should appear identically wherever submittal logic runs: AUTO_ROUTE = 0.92, REVIEW_FLOOR = 0.75, and a per-project CHANGE_ORDER_THRESHOLD (default Decimal("5000.00")).

Verification and testing

Treat normalization and routing as pure functions and assert against known inputs. The serialized contract is checked with model_dump_json so the wire format is part of the test surface, not an afterthought.

from datetime import datetime, timezone
from decimal import Decimal
from uuid import uuid4


def _revision(**overrides) -> SubmittalRevision:
    base = dict(
        project_uuid=uuid4(),
        submittal_number="03 30 00-014",
        revision_id="R01",
        csi_section="03 30 00",
        wbs_node="PROJ-014-STR-03",
        discipline="STR",
        status="approved",
        created_at=datetime(2026, 6, 27, 14, 30, tzinfo=timezone.utc),
        descriptive=SubmittalDescriptive(
            kind="product_data",
            manufacturer="Acme Concrete",
            specified_basis="4000 psi ready-mix, Type II cement",
            proposed_product="4000 psi ready-mix, Type II cement",
        ),
        financial=SubmittalFinancial(
            budget_code="GL1001", unit_cost=Decimal("120.00"), quantity=Decimal("80"),
        ),
        extraction_confidence=0.97,
    )
    base.update(overrides)
    return SubmittalRevision(**base)


def test_csi_normalization_is_idempotent():
    assert normalize_csi("26.05.00") == "26 05 00"
    assert normalize_csi("260500") == normalize_csi("26 05 00")


def test_exact_product_auto_routes_change_order():
    rev = _revision()  # impact 120 * 80 = 9600 >= 5000
    assert rev.total_cost_impact == Decimal("9600.00")
    assert route_submittal(rev) == "draft_change_order"


def test_low_confidence_substitution_quarantines():
    rev = _revision(extraction_confidence=0.61)
    assert route_submittal(rev) == "quarantine"


def test_rejected_never_drafts_change_order():
    rev = _revision(status="rejected")
    assert route_submittal(rev) == "approval_queue"


def test_serialized_contract_round_trips():
    rev = _revision()
    restored = SubmittalRevision.model_validate_json(rev.model_dump_json())
    assert restored.total_cost_impact == rev.total_cost_impact

Run the suite with pytest -q. For a quick manual smoke test, pipe a sample payload through python -c "import json,sys; from submittal import SubmittalRevision; print(SubmittalRevision.model_validate_json(sys.stdin.read()).model_dump_json(indent=2))" and confirm the CSI section and WBS node echo back in canonical form.

Troubleshooting

ValidationError on submittal_number for a valid-looking package. The number’s CSI prefix does not match csi_section — often because the section was hand-typed with periods (03.30.00) while the number used spaces. Normalize the CSI string before constructing the model so both fields agree, then re-validate.
European decimal formats corrupt unit_cost. A vendor quote of 1.234,56 parses as 1.234 once Python’s Decimal ignores the trailing group. Detect the locale at the extraction boundary and convert 1.234,56 → 1234.56 before it reaches the schema; never let implicit coercion silently truncate a cost.
Substitution confidence collapses on stamped drawings. OCR over an engineer’s wet stamp drags extraction_confidence below 0.75, so every revision quarantines. Raise the upstream OCR quality with the OCR preprocessing deskew and threshold step rather than lowering REVIEW_FLOOR, which would let real mismatches through.
Duplicate change orders from one submittal. A redelivered broker message re-ran route_submittal and published twice. Make the outer task idempotent: commit keyed on submittal_number plus revision_id, so a replay is a no-op even though the pure routing function happily returns the same verdict.
Bid-period submittal bursts overflow the worker pool. When dozens of packages land in an hour the synchronous validator times out. Move ingestion behind the async batching workflows queue so revisions are processed in bounded batches instead of contending for one connection.

Frequently Asked Questions

Why split descriptive and financial metadata into separate models?

A submittal gets re-classified often — a reviewer corrects the discipline or the spec section after the fact — but its committed cost record must never move underneath the ledger. Keeping SubmittalDescriptive and SubmittalFinancial as distinct models means a classification fix touches only descriptive fields, preserving the audit trail on every Decimal that has already posted against a budget code.

How do the confidence bands apply to a submittal?

They govern substitution matching and scope resolution. An exact or near-exact match of 0.92 or above auto-routes the revision. A combined score of 0.75–0.92 files the record but flags it for an engineer’s review. Below 0.75, no substitution is trusted and the record is quarantined to the dead-letter queue rather than accepted against a guessed specification basis.

Why is status a Literal rather than a free string?

Change-order eligibility is decided directly from status, so an unrecognized value would be a routing hazard. A Literal of the seven canonical review states rejects typos at validation time, guaranteeing that only approved and approved_as_noted revisions can ever reach the change-order branch and that a revise_and_resubmit loops back to the reviewer instead.

Why model the Revise and Resubmit loop explicitly?

Submittals are iterative by nature; a package commonly cycles two or three times before approval. Modeling the loop with an incrementing revision_id (R00, R01, R02) preserves the full lineage as a directed acyclic chain, so an auditor can trace which revision was in force on a given date and the cost engine never double-counts a superseded delta.

Why must submittal ingestion be idempotent?

Brokers retry. When a message is redelivered after a transient fault, normalization, substitution matching, and routing must produce the identical verdict so the retry is a no-op. Pairing pure functions with a commit keyed on submittal_number plus revision_id is what prevents one approved package from generating two change order events.

← Back to Construction Data Architecture & Taxonomy

Submittal Metadata Frameworks

Explore in this section