Skip to content

Batch processing Excel submittal logs with Pandas DataFrames

Construction submittal logs are rarely pristine. They originate from disparate general contractors, architects, and subcontractors, frequently formatted with merged cells, inconsistent date standards, and ad-hoc status labels. Automating the ingestion of these logs requires a deterministic parsing pipeline that normalizes raw Excel data into a queryable schema before downstream routing or compliance checks. This process forms the foundation of Automated Document Ingestion & Parsing systems, where raw spreadsheets are transformed into structured records ready for integration with Procore, Autodesk Build, or custom ERP platforms.

1. Header Flattening & Schema Mapping

Excel submittal logs frequently use multi-row headers or merged cells for visual grouping. Pandas read_excel must be configured to skip irrelevant title rows and flatten hierarchical indices before schema mapping. A robust approach maps raw column aliases to a standardized construction schema aligned with CSI MasterFormat and AIA G712 tracking conventions.

The following function reads the workbook, flattens multi-index columns, and applies case-insensitive regex matching to standardize field names. It includes explicit error handling for missing files, empty workbooks, and schema mismatches.

import pandas as pd
import re
import logging
from pathlib import Path
from typing import Dict, List

logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s")
logger = logging.getLogger(__name__)

STANDARD_SCHEMA: Dict[str, str] = {
    'submittal no': 'submittal_id',
    'spec section': 'spec_section',
    'description': 'description',
    'status': 'status',
    'due date': 'due_date',
    'assigned to': 'assigned_to',
    'revision': 'revision'
}

def load_and_flatten_log(filepath: str) -> pd.DataFrame:
    """Reads an Excel submittal log, flattens multi-headers, and maps to standard schema."""
    path = Path(filepath)
    if not path.exists():
        raise FileNotFoundError(f"Submittal log not found: {filepath}")

    try:
        # Read with multi-row headers; adjust header=[0, 1] based on actual log structure
        df = pd.read_excel(filepath, header=[0, 1], engine='openpyxl')
    except Exception as e:
        raise RuntimeError(f"Failed to parse Excel workbook: {e}") from e

    if df.empty:
        raise ValueError(f"Workbook contains no parseable data: {filepath}")

    # Flatten hierarchical columns
    df.columns = ['_'.join(col).strip('_') for col in df.columns]

    # Regex-based schema mapping
    col_mapping: Dict[str, str] = {}
    for raw_col in df.columns:
        for key, std_key in STANDARD_SCHEMA.items():
            if re.search(key, raw_col, re.IGNORECASE):
                col_mapping[raw_col] = std_key
                break

    df = df.rename(columns=col_mapping)

    # Validate critical fields exist
    required = {'submittal_id', 'spec_section', 'status'}
    missing = required - set(df.columns)
    if missing:
        logger.warning(f"Missing critical columns in {filepath}: {missing}")

    return df

2. Type Coercion & Status Normalization

Construction logs suffer from inconsistent date formats and status variations. Pandas type coercion must handle MM/DD/YYYY, YYYY-MM-DD, and DD-Mon-YY simultaneously, while mapping abbreviated statuses to full AIA-compliant strings. The normalization function below enforces strict type casting, standardizes CSI section formatting, and quarantines unparseable values to prevent silent data corruption.

Refer to the pandas datetime parsing documentation for advanced format parameter configurations when dealing with locale-specific date strings.

STATUS_MAP: Dict[str, str] = {
    'approved': 'Approved', 'appr': 'Approved', 'a': 'Approved',
    'approved as noted': 'Approved as Noted', 'an': 'Approved as Noted',
    'revise and resubmit': 'Revise & Resubmit', 'r&r': 'Revise & Resubmit',
    'rejected': 'Rejected', 'r': 'Rejected',
    'pending': 'Pending Review', 'p': 'Pending Review'
}

def normalize_types(df: pd.DataFrame) -> pd.DataFrame:
    """Enforces strict typing, normalizes statuses, and flags malformed records."""
    # Date coercion with fallback to NaT for unparseable values
    if 'due_date' in df.columns:
        df['due_date'] = pd.to_datetime(df['due_date'], dayfirst=False, errors='coerce')

    # Status normalization
    if 'status' in df.columns:
        df['status'] = df['status'].astype(str).str.strip().str.lower()
        df['status'] = df['status'].map(STATUS_MAP).fillna(df['status'].str.title())

    # CSI Section standardization (e.g., '03 30 00' or '3.30.00' -> '03-30-00')
    if 'spec_section' in df.columns:
        df['spec_section'] = (
            df['spec_section']
            .astype(str)
            .str.replace(r'[^0-9]', ' ', regex=True)
            .str.strip()
            .apply(lambda x: '-'.join(x.split()) if len(x.split()) == 3 else x)
        )

    # Quarantine rows with critical missing data
    critical_null_mask = df[['submittal_id', 'status']].isna().any(axis=1)
    if critical_null_mask.any():
        logger.warning(f"Quarantining {critical_null_mask.sum()} rows with missing critical fields.")
        df['_parse_status'] = 'quarantine'
        df.loc[~critical_null_mask, '_parse_status'] = 'valid'
    else:
        df['_parse_status'] = 'valid'

    return df

3. Batch Execution & Error Quarantine

Processing individual files sequentially is inefficient for large project portfolios. Implementing a directory scanner with structured error logging enables scalable ingestion. This pattern integrates directly into Async Batching Workflows by decoupling file discovery from transformation logic.

The following pipeline aggregates valid records into a unified DataFrame while isolating parsing failures for manual review.

def process_submittal_batch(input_dir: Path, output_dir: Path) -> Dict[str, pd.DataFrame]:
    """Processes all .xlsx files in a directory, returning valid and quarantined DataFrames."""
    valid_records: List[pd.DataFrame] = []
    quarantined_records: List[pd.DataFrame] = []

    input_dir.mkdir(parents=True, exist_ok=True)
    output_dir.mkdir(parents=True, exist_ok=True)

    excel_files = list(input_dir.glob('*.xlsx'))
    if not excel_files:
        logger.info("No Excel files found in target directory.")
        return {'valid': pd.DataFrame(), 'quarantined': pd.DataFrame()}

    for file_path in excel_files:
        try:
            logger.info(f"Processing: {file_path.name}")
            df = load_and_flatten_log(str(file_path))
            df = normalize_types(df)

            # Split based on parse status
            valid_mask = df['_parse_status'] == 'valid'
            valid_records.append(df[valid_mask].drop(columns=['_parse_status']))
            quarantined_records.append(df[~valid_mask])

        except Exception as e:
            logger.error(f"Batch failure for {file_path.name}: {e}")
            # Create a minimal failure record for audit trails
            failure_df = pd.DataFrame([{'source_file': file_path.name, 'error': str(e), '_parse_status': 'quarantine'}])
            quarantined_records.append(failure_df)

    # Concatenate results
    valid_df = pd.concat(valid_records, ignore_index=True) if valid_records else pd.DataFrame()
    quarantine_df = pd.concat(quarantined_records, ignore_index=True) if quarantined_records else pd.DataFrame()

    # Export to Parquet for downstream consumption
    if not valid_df.empty:
        valid_df.to_parquet(output_dir / 'submittals_clean.parquet', index=False)
    if not quarantine_df.empty:
        quarantine_df.to_parquet(output_dir / 'submittals_quarantine.parquet', index=False)

    logger.info(f"Batch complete. Valid: {len(valid_df)}, Quarantined: {len(quarantine_df)}")
    return {'valid': valid_df, 'quarantined': quarantine_df}

4. Validation & Downstream Routing

Before routing normalized logs to construction management APIs, enforce schema validation to prevent payload rejection. Use pydantic or pandera for strict row-level validation, or implement lightweight pandas assertions:

def validate_schema(df: pd.DataFrame) -> None:
    """Asserts critical constraints before API routing."""
    assert df['submittal_id'].notna().all(), "Null submittal IDs detected post-normalization."
    assert df['status'].isin(['Approved', 'Approved as Noted', 'Revise & Resubmit', 'Rejected', 'Pending Review']).all(), \
           "Non-compliant status values detected."
    assert df['due_date'].dtype == 'datetime64[ns]', "Due dates failed datetime coercion."

For field extraction techniques involving scanned PDFs or image-based logs, integrate OCR preprocessing before the Pandas pipeline. Consult the OpenPyXL documentation for advanced cell styling, comment extraction, and protected sheet handling when dealing with architect-issued templates.