Batch processing Excel submittal logs with Pandas DataFrames
Construction submittal logs are rarely pristine. They originate from disparate general contractors, architects, and subcontractors, frequently formatted with merged cells, inconsistent date standards, and ad-hoc status labels. Automating the ingestion of these logs requires a deterministic parsing pipeline that normalizes raw Excel data into a queryable schema before downstream routing or compliance checks. This process forms the foundation of Automated Document Ingestion & Parsing systems, where raw spreadsheets are transformed into structured records ready for integration with Procore, Autodesk Build, or custom ERP platforms.
1. Header Flattening & Schema Mapping
Excel submittal logs frequently use multi-row headers or merged cells for visual grouping. Pandas read_excel must be configured to skip irrelevant title rows and flatten hierarchical indices before schema mapping. A robust approach maps raw column aliases to a standardized construction schema aligned with CSI MasterFormat and AIA G712 tracking conventions.
The following function reads the workbook, flattens multi-index columns, and applies case-insensitive regex matching to standardize field names. It includes explicit error handling for missing files, empty workbooks, and schema mismatches.
import pandas as pd
import re
import logging
from pathlib import Path
from typing import Dict, List
logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s")
logger = logging.getLogger(__name__)
STANDARD_SCHEMA: Dict[str, str] = {
'submittal no': 'submittal_id',
'spec section': 'spec_section',
'description': 'description',
'status': 'status',
'due date': 'due_date',
'assigned to': 'assigned_to',
'revision': 'revision'
}
def load_and_flatten_log(filepath: str) -> pd.DataFrame:
"""Reads an Excel submittal log, flattens multi-headers, and maps to standard schema."""
path = Path(filepath)
if not path.exists():
raise FileNotFoundError(f"Submittal log not found: {filepath}")
try:
# Read with multi-row headers; adjust header=[0, 1] based on actual log structure
df = pd.read_excel(filepath, header=[0, 1], engine='openpyxl')
except Exception as e:
raise RuntimeError(f"Failed to parse Excel workbook: {e}") from e
if df.empty:
raise ValueError(f"Workbook contains no parseable data: {filepath}")
# Flatten hierarchical columns
df.columns = ['_'.join(col).strip('_') for col in df.columns]
# Regex-based schema mapping
col_mapping: Dict[str, str] = {}
for raw_col in df.columns:
for key, std_key in STANDARD_SCHEMA.items():
if re.search(key, raw_col, re.IGNORECASE):
col_mapping[raw_col] = std_key
break
df = df.rename(columns=col_mapping)
# Validate critical fields exist
required = {'submittal_id', 'spec_section', 'status'}
missing = required - set(df.columns)
if missing:
logger.warning(f"Missing critical columns in {filepath}: {missing}")
return df2. Type Coercion & Status Normalization
Construction logs suffer from inconsistent date formats and status variations. Pandas type coercion must handle MM/DD/YYYY, YYYY-MM-DD, and DD-Mon-YY simultaneously, while mapping abbreviated statuses to full AIA-compliant strings. The normalization function below enforces strict type casting, standardizes CSI section formatting, and quarantines unparseable values to prevent silent data corruption.
Refer to the pandas datetime parsing documentation for advanced format parameter configurations when dealing with locale-specific date strings.
STATUS_MAP: Dict[str, str] = {
'approved': 'Approved', 'appr': 'Approved', 'a': 'Approved',
'approved as noted': 'Approved as Noted', 'an': 'Approved as Noted',
'revise and resubmit': 'Revise & Resubmit', 'r&r': 'Revise & Resubmit',
'rejected': 'Rejected', 'r': 'Rejected',
'pending': 'Pending Review', 'p': 'Pending Review'
}
def normalize_types(df: pd.DataFrame) -> pd.DataFrame:
"""Enforces strict typing, normalizes statuses, and flags malformed records."""
# Date coercion with fallback to NaT for unparseable values
if 'due_date' in df.columns:
df['due_date'] = pd.to_datetime(df['due_date'], dayfirst=False, errors='coerce')
# Status normalization
if 'status' in df.columns:
df['status'] = df['status'].astype(str).str.strip().str.lower()
df['status'] = df['status'].map(STATUS_MAP).fillna(df['status'].str.title())
# CSI Section standardization (e.g., '03 30 00' or '3.30.00' -> '03-30-00')
if 'spec_section' in df.columns:
df['spec_section'] = (
df['spec_section']
.astype(str)
.str.replace(r'[^0-9]', ' ', regex=True)
.str.strip()
.apply(lambda x: '-'.join(x.split()) if len(x.split()) == 3 else x)
)
# Quarantine rows with critical missing data
critical_null_mask = df[['submittal_id', 'status']].isna().any(axis=1)
if critical_null_mask.any():
logger.warning(f"Quarantining {critical_null_mask.sum()} rows with missing critical fields.")
df['_parse_status'] = 'quarantine'
df.loc[~critical_null_mask, '_parse_status'] = 'valid'
else:
df['_parse_status'] = 'valid'
return df3. Batch Execution & Error Quarantine
Processing individual files sequentially is inefficient for large project portfolios. Implementing a directory scanner with structured error logging enables scalable ingestion. This pattern integrates directly into Async Batching Workflows by decoupling file discovery from transformation logic.
The following pipeline aggregates valid records into a unified DataFrame while isolating parsing failures for manual review.
def process_submittal_batch(input_dir: Path, output_dir: Path) -> Dict[str, pd.DataFrame]:
"""Processes all .xlsx files in a directory, returning valid and quarantined DataFrames."""
valid_records: List[pd.DataFrame] = []
quarantined_records: List[pd.DataFrame] = []
input_dir.mkdir(parents=True, exist_ok=True)
output_dir.mkdir(parents=True, exist_ok=True)
excel_files = list(input_dir.glob('*.xlsx'))
if not excel_files:
logger.info("No Excel files found in target directory.")
return {'valid': pd.DataFrame(), 'quarantined': pd.DataFrame()}
for file_path in excel_files:
try:
logger.info(f"Processing: {file_path.name}")
df = load_and_flatten_log(str(file_path))
df = normalize_types(df)
# Split based on parse status
valid_mask = df['_parse_status'] == 'valid'
valid_records.append(df[valid_mask].drop(columns=['_parse_status']))
quarantined_records.append(df[~valid_mask])
except Exception as e:
logger.error(f"Batch failure for {file_path.name}: {e}")
# Create a minimal failure record for audit trails
failure_df = pd.DataFrame([{'source_file': file_path.name, 'error': str(e), '_parse_status': 'quarantine'}])
quarantined_records.append(failure_df)
# Concatenate results
valid_df = pd.concat(valid_records, ignore_index=True) if valid_records else pd.DataFrame()
quarantine_df = pd.concat(quarantined_records, ignore_index=True) if quarantined_records else pd.DataFrame()
# Export to Parquet for downstream consumption
if not valid_df.empty:
valid_df.to_parquet(output_dir / 'submittals_clean.parquet', index=False)
if not quarantine_df.empty:
quarantine_df.to_parquet(output_dir / 'submittals_quarantine.parquet', index=False)
logger.info(f"Batch complete. Valid: {len(valid_df)}, Quarantined: {len(quarantine_df)}")
return {'valid': valid_df, 'quarantined': quarantine_df}4. Validation & Downstream Routing
Before routing normalized logs to construction management APIs, enforce schema validation to prevent payload rejection. Use pydantic or pandera for strict row-level validation, or implement lightweight pandas assertions:
def validate_schema(df: pd.DataFrame) -> None:
"""Asserts critical constraints before API routing."""
assert df['submittal_id'].notna().all(), "Null submittal IDs detected post-normalization."
assert df['status'].isin(['Approved', 'Approved as Noted', 'Revise & Resubmit', 'Rejected', 'Pending Review']).all(), \
"Non-compliant status values detected."
assert df['due_date'].dtype == 'datetime64[ns]', "Due dates failed datetime coercion."For field extraction techniques involving scanned PDFs or image-based logs, integrate OCR preprocessing before the Pandas pipeline. Consult the OpenPyXL documentation for advanced cell styling, comment extraction, and protected sheet handling when dealing with architect-issued templates.