Batch processing Excel submittal logs with Pandas DataFrames

Construction submittal logs are rarely pristine. They arrive from disparate general contractors, architects, and subcontractors — frequently formatted with merged header cells, inconsistent date standards, and ad-hoc status labels typed by hand at 4 p.m. on a Friday. This page covers exactly one slice of the pipeline: how to ingest a directory of .xlsx submittal logs with Pandas and emit a single, strictly typed, queryable table — without silently dropping rows or corrupting the cost and schedule fields that downstream systems depend on. Getting this transform deterministic is what lets the rest of an automated document ingestion and parsing workflow trust the data it receives before routing records into Procore, Autodesk Build, or a custom ERP. This work runs inside the broader async batching workflows that decouple file receipt from transformation, so the batch transform here is the unit a worker executes once a burst of logs lands.

Key rules and specification

A submittal-log batch transform is governed by a handful of non-negotiable constraints. Encode each one as a typed field or an explicit branch — never as an implicit assumption about “clean” input.

Rule	Specification
Spec section format	CSI MasterFormat six-digit `XX XX XX` (e.g. `03 30 00`), normalized from `03-30-00`, `3.30.00`, or `033000`
Status domain	Closed enum: `Approved`, `Approved as Noted`, `Revise & Resubmit`, `Rejected`, `Pending Review`
Dates	Parsed permissively, stored as ISO 8601 (`datetime64[ns]`); unparseable values become `NaT`, never a guess
Money/quantity	Never float — parse to `Decimal` to avoid drift in cumulative rollups
Routing confidence	`>= 0.92` auto-route, `0.75–0.92` human-review, `< 0.75` quarantine — the site-canonical thresholds
Failure handling	A bad row is quarantined with a reason, never dropped; a bad file produces an audit record, never an exception that kills the batch

The status domain and spec-section pattern belong in the validation rules that gate the record before handoff, and the spec section itself is the join key for WBS mapping once the row is clean.

Header flattening and schema mapping

Submittal logs frequently use multi-row headers or merged cells for visual grouping. read_excel must be told how many header rows exist, then the resulting MultiIndex is flattened before any column can be matched. Regex-based, case-insensitive matching maps each raw alias onto a canonical field name so that “Submittal No.”, “SUBMITTAL #”, and “Submittal Number” all collapse to submittal_id.

import re
import logging
from pathlib import Path
from typing import Dict

import pandas as pd

logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s")
logger = logging.getLogger(__name__)

# Raw alias (regex) -> canonical field
STANDARD_SCHEMA: Dict[str, str] = {
    r"submittal\s*(no|#|number)": "submittal_id",
    r"spec\s*section": "spec_section",
    r"description|title": "description",
    r"status|disposition": "status",
    r"due\s*date|required": "due_date",
    r"assigned|ball\s*in\s*court": "assigned_to",
    r"rev(ision)?": "revision",
}

def load_and_flatten_log(filepath: str) -> pd.DataFrame:
    """Read an Excel submittal log, flatten multi-row headers, map to the canonical schema."""
    path = Path(filepath)
    if not path.exists():
        raise FileNotFoundError(f"Submittal log not found: {filepath}")

    try:
        # header=[0, 1] handles the common two-row merged header; tune per template family.
        df = pd.read_excel(filepath, header=[0, 1], engine="openpyxl")
    except Exception as exc:
        raise RuntimeError(f"Failed to parse workbook {filepath}: {exc}") from exc

    if df.empty:
        raise ValueError(f"Workbook contains no parseable rows: {filepath}")

    # Flatten the MultiIndex: ("Submittal", "No.") -> "Submittal_No."
    df.columns = ["_".join(str(c) for c in col).strip("_") for col in df.columns]

    col_mapping: Dict[str, str] = {}
    for raw_col in df.columns:
        for pattern, canonical in STANDARD_SCHEMA.items():
            if re.search(pattern, raw_col, re.IGNORECASE):
                col_mapping[raw_col] = canonical
                break
    df = df.rename(columns=col_mapping)

    required = {"submittal_id", "spec_section", "status"}
    missing = required - set(df.columns)
    if missing:
        logger.warning("Missing critical columns in %s: %s", filepath, missing)

    return df

Type coercion, status normalization, and confidence scoring

Raw extraction yields strings. Construction logs mix MM/DD/YYYY, YYYY-MM-DD, and DD-Mon-YY dates and abbreviate statuses inconsistently (Appr., A/N, R&R). Coerce dates permissively to NaT rather than guessing a value, normalize statuses against the closed domain, reformat the spec section to the canonical XX XX XX, and attach a per-row confidence score that drives routing. See the pandas datetime parsing documentation for format tuning on locale-specific strings.

from typing import Literal

StatusLiteral = Literal[
    "Approved", "Approved as Noted", "Revise & Resubmit", "Rejected", "Pending Review"
]

STATUS_MAP: Dict[str, str] = {
    "approved": "Approved", "appr": "Approved", "a": "Approved",
    "approved as noted": "Approved as Noted", "an": "Approved as Noted", "a/n": "Approved as Noted",
    "revise and resubmit": "Revise & Resubmit", "r&r": "Revise & Resubmit",
    "rejected": "Rejected", "r": "Rejected",
    "pending": "Pending Review", "pending review": "Pending Review", "p": "Pending Review",
}
VALID_STATUSES = set(STATUS_MAP.values())

def normalize_types(df: pd.DataFrame) -> pd.DataFrame:
    """Coerce types, normalize the status enum and CSI section, and score each row 0.0-1.0."""
    if "due_date" in df.columns:
        # errors="coerce" -> NaT, never a fabricated date.
        df["due_date"] = pd.to_datetime(df["due_date"], errors="coerce")

    if "status" in df.columns:
        raw = df["status"].astype(str).str.strip().str.lower()
        df["status"] = raw.map(STATUS_MAP).fillna(raw.str.title())

    if "spec_section" in df.columns:
        # "03-30-00" / "3.30.00" / "033000" -> canonical "03 30 00"
        digits = df["spec_section"].astype(str).str.replace(r"\D", "", regex=True)
        df["spec_section"] = digits.apply(
            lambda d: f"{d[0:2]} {d[2:4]} {d[4:6]}" if len(d) == 6 else d.strip()
        )

    # Confidence: start at 1.0, dock points for each field that failed to resolve cleanly.
    score = pd.Series(1.0, index=df.index)
    score -= df["submittal_id"].isna().astype(float) * 0.40 if "submittal_id" in df else 0.0
    score -= (~df["status"].isin(VALID_STATUSES)).astype(float) * 0.30 if "status" in df else 0.0
    score -= df["due_date"].isna().astype(float) * 0.15 if "due_date" in df else 0.0
    score -= (~df["spec_section"].str.fullmatch(r"\d{2} \d{2} \d{2}", na=False)).astype(float) * 0.15 \
        if "spec_section" in df else 0.0
    df["parse_confidence"] = score.clip(lower=0.0)

    # Map confidence to a routing state using the canonical thresholds.
    df["routing_state"] = pd.cut(
        df["parse_confidence"],
        bins=[-0.01, 0.75, 0.92, 1.01],
        labels=["quarantine", "human_review", "auto_route"],
    ).astype(str)

    return df

Batch execution and quarantine

Processing files one call at a time is fine; the batch wrapper is what makes it operable. Scan the input directory, run the two transforms per file, and split the result by routing state. A malformed file never raises into the loop — it produces an audit row so the batch finishes and the failure is reviewable. Valid records and quarantined records are written to separate Parquet files for downstream consumption.

from typing import Dict, List

def process_submittal_batch(input_dir: Path, output_dir: Path) -> Dict[str, pd.DataFrame]:
    """Process every .xlsx in input_dir; return {'routable', 'review', 'quarantined'} frames."""
    routable: List[pd.DataFrame] = []
    review: List[pd.DataFrame] = []
    quarantined: List[pd.DataFrame] = []

    input_dir.mkdir(parents=True, exist_ok=True)
    output_dir.mkdir(parents=True, exist_ok=True)

    excel_files = sorted(input_dir.glob("*.xlsx"))
    if not excel_files:
        logger.info("No Excel files found in %s", input_dir)
        return {"routable": pd.DataFrame(), "review": pd.DataFrame(), "quarantined": pd.DataFrame()}

    for file_path in excel_files:
        try:
            logger.info("Processing %s", file_path.name)
            df = normalize_types(load_and_flatten_log(str(file_path)))
            df["source_file"] = file_path.name
            routable.append(df[df["routing_state"] == "auto_route"])
            review.append(df[df["routing_state"] == "human_review"])
            quarantined.append(df[df["routing_state"] == "quarantine"])
        except Exception as exc:
            # File-level failure becomes an audit record, not a dead batch.
            logger.error("Batch failure for %s: %s", file_path.name, exc)
            quarantined.append(pd.DataFrame([{
                "source_file": file_path.name,
                "error": str(exc),
                "routing_state": "quarantine",
                "parse_confidence": 0.0,
            }]))

    def _concat(frames: List[pd.DataFrame]) -> pd.DataFrame:
        return pd.concat(frames, ignore_index=True) if frames else pd.DataFrame()

    result = {"routable": _concat(routable), "review": _concat(review), "quarantined": _concat(quarantined)}

    if not result["routable"].empty:
        result["routable"].to_parquet(output_dir / "submittals_routable.parquet", index=False)
    if not result["quarantined"].empty:
        result["quarantined"].to_parquet(output_dir / "submittals_quarantine.parquet", index=False)

    logger.info(
        "Batch complete. Routable=%d Review=%d Quarantined=%d",
        len(result["routable"]), len(result["review"]), len(result["quarantined"]),
    )
    return result

Validation before handoff

Auto-routable rows still get a final, hard schema gate before they touch a construction-management API — a payload rejected at the gateway is far more expensive to debug than an assertion failing here. Validate the closed status domain, the canonical spec-section pattern, and the date dtype with Pydantic v2 so the contract is explicit and the error messages name the offending field.

from decimal import Decimal
from datetime import date
from typing import Optional

from pydantic import BaseModel, Field, field_validator

SPEC_RE = r"^\d{2} \d{2} \d{2}$"  # CSI MasterFormat XX XX XX

class SubmittalRecord(BaseModel):
    submittal_id: str = Field(min_length=1, max_length=32)
    spec_section: str = Field(pattern=SPEC_RE)
    status: StatusLiteral
    due_date: Optional[date] = None
    revision: Optional[int] = Field(default=None, ge=0)
    parse_confidence: Decimal = Field(ge=0, le=1)

    @field_validator("spec_section")
    @classmethod
    def normalize_section(cls, v: str) -> str:
        digits = "".join(ch for ch in v if ch.isdigit())
        if len(digits) != 6:
            raise ValueError(f"spec_section must be 6 CSI digits, got {v!r}")
        return f"{digits[0:2]} {digits[2:4]} {digits[4:6]}"

def to_validated_records(df: pd.DataFrame) -> list[dict]:
    """Validate auto-routable rows; raises on the first contract breach so nothing silently leaks."""
    return [SubmittalRecord(**row).model_dump(mode="json")
            for row in df.to_dict(orient="records")]

Common mistakes and gotchas

Letting Pandas infer the date format per cell. to_datetime without errors="coerce" will raise on the first bad string and kill the whole frame; worse, mixed dayfirst logs can swap day and month silently. A due_date of 04 06 2026 that flips to June instead of April misroutes the entire approval clock. Coerce to NaT, dock confidence, and quarantine — never guess.
Reading money as float. A cost_impact column parsed as float64 accumulates binary-rounding drift the moment you groupby(...).sum() across hundreds of submittals, and the rolled-up total will not reconcile against the ledger. Parse currency and quantities to Decimal, the same discipline the budget cost-code standardization work relies on.
Dropping unmapped rows instead of quarantining them. df.dropna() after schema mapping feels tidy, but every dropped row is a submittal that vanishes from the project’s compliance record with no audit trail. Tag rows with a routing_state and persist the quarantine set; a human reconciles it, the batch never deletes it.

Integration pointer

This transform sits one layer below the queue. A worker in the async batching workflows tier pulls a burst of logs from the broker, calls process_submittal_batch, and forwards only the routable frame downstream while the review and quarantined frames branch off. Files that turn out to be scanned images rather than real workbooks should be diverted to OCR preprocessing for construction docs before they ever reach this code. From here, validated records feed the same join keys used in PDF/Excel sync pipelines and the RFI-side equivalent, validating extracted RFI fields against custom JSON schemas.

Frequently asked questions

How do I handle submittal logs where the header row count varies between files?

Detect it instead of hard-coding header=[0, 1]. Read the first few rows with header=None, find the row index whose cells best match your alias regexes, then re-read with that as the header. Group templates into families and store the detected header depth per source so repeat files from the same general contractor skip the detection step.

Why use Decimal instead of float for cost and quantity columns?

Floating-point math cannot represent most decimal fractions exactly, so summing many line items introduces drift that breaks reconciliation against the cost ledger. Decimal preserves exact base-10 values, which is mandatory anywhere construction money is aggregated. Convert at parse time with Decimal(str(value)), never Decimal(float_value).

What do the 0.92 and 0.75 confidence thresholds map to?

They are the site-canonical routing bands. A row scoring 0.92 or higher is clean enough to auto-route to the downstream API; 0.75 to 0.92 goes to a human-review queue where an estimator confirms ambiguous fields; below 0.75 the row is quarantined for correction. Tuning the per-field penalties in normalize_types shifts how aggressively rows fall into each band.

Should I use Parquet or CSV for the batch output?

Parquet. It preserves the datetime64[ns] and Decimal dtypes you worked to establish, compresses far better on wide submittal tables, and supports predicate pushdown when a downstream job queries only auto_route rows. CSV would silently flatten every typed column back to strings, undoing the normalization.

Async Batching Workflows — the queue tier that schedules this batch transform
Parsing unstructured PDF change orders with Python and pypdf — the PDF-side counterpart to this Excel ingestion
Validating extracted RFI fields against custom JSON schemas — the schema gate pattern applied to RFIs
Implementing retry logic for failed API document pulls — error-handling discipline for the routing stage
How to map CSI MasterFormat to custom WBS codes in Python — what the normalized spec section joins against

← Back to Async Batching Workflows