Skip to content

How to map CSI MasterFormat to custom WBS codes in Python

Construction data pipelines routinely ingest cost codes, RFIs, and submittals tagged with CSI MasterFormat divisions. Translating these standardized taxonomy entries into enterprise-specific Work Breakdown Structure (WBS) hierarchies requires deterministic parsing logic. Without a robust mapping layer, downstream budgeting, scheduling, and compliance integrations fracture under inconsistent string matching. This guide details a production-ready Python implementation for parsing CSI MasterFormat strings and resolving them to custom WBS codes, emphasizing strict schema validation, fallback routing, and audit-ready traceability. For broader architectural context, refer to Construction Data Architecture & Taxonomy guidelines when designing enterprise data lakes.

CSI MasterFormat Normalization Logic

CSI MasterFormat (2018/2020 editions) utilizes a six-digit hierarchical format: DD-SS-NN representing Division, BroadScope, and NarrowScope. Legacy projects frequently retain 2004 five-digit formats (DD-SSS) or pre-2004 sixteen-division systems. Vendor-specific suffixes (e.g., -ALT1, -REV, -01) and inconsistent delimiter usage (spaces, hyphens, or none) further complicate direct dictionary lookups. A reliable mapping engine must normalize inputs to a canonical DD-SS-NN representation before resolution.

The normalization routine strips non-numeric characters, validates digit count, and pads legacy formats to align with modern six-digit expectations. Division zero (00) and general requirements (01) require special handling due to their cross-disciplinary nature. Implementing this normalization upfront prevents cascading schema mismatches in downstream workflows. For deeper pattern matching techniques, consult the official Python re module documentation.

import re
import logging
from typing import Dict, Optional, Tuple, List
from dataclasses import dataclass

# Configure structured logging for audit trails
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s | %(levelname)-8s | %(message)s",
    datefmt="%Y-%m-%d %H:%M:%S"
)
logger = logging.getLogger(__name__)

@dataclass(frozen=True)
class WBSMappingResult:
    original_input: str
    normalized_csi: str
    wbs_code: str
    match_type: str  # 'exact', 'division', 'regex', 'fallback'
    confidence: float

class CSIToWBSMapper:
    """Deterministic CSI MasterFormat to WBS code resolver."""

    def __init__(
        self,
        exact_map: Dict[str, str],
        division_map: Dict[str, str],
        default_wbs: str = "00-00-00-00"
    ):
        self.exact_map = {k.upper().strip(): v for k, v in exact_map.items()}
        self.division_map = {k.upper().strip(): v for k, v in division_map.items()}
        self.default_wbs = default_wbs
        # Matches 5 or 6 digits, optionally separated by hyphens/spaces
        self.csi_pattern = re.compile(r"^(\d{2})[-\s]?(\d{2,3})[-\s]?(\d{2,3})?$")

    def normalize_csi(self, raw_code: str) -> str:
        """Strips suffixes, extracts digits, and pads to canonical DD-SS-NN."""
        clean = re.sub(r"[-_]\w+", "", raw_code)
        digits_only = re.sub(r"[^\d]", "", clean)

        if len(digits_only) not in (5, 6):
            raise ValueError(f"Invalid CSI digit count ({len(digits_only)}): {raw_code}")

        if len(digits_only) == 5:
            # Legacy 2004: DD-SSS -> Pad to DD-SS-NN (append trailing '0' for NarrowScope)
            digits_only = f"{digits_only}0"

        return f"{digits_only[:2]}-{digits_only[2:4]}-{digits_only[4:]}"

Deterministic Resolution & Fallback Routing

Once normalized, the engine applies a cascading resolution strategy. Exact matches receive maximum confidence. Division-level matches route to parent WBS buckets when granular codes are unmapped. Cross-divisional codes (00- and 01-) trigger regex-based routing to project-wide overhead buckets. Unresolvable inputs default to a designated catch-all code while logging warnings for manual review. This tiered approach aligns with established WBS Mapping Strategies for enterprise cost control.

# Extends the CSIToWBSMapper class defined above.
class CSIToWBSMapper(CSIToWBSMapper):
    def resolve(self, raw_code: str) -> WBSMappingResult:
        """Resolves a raw CSI string to a WBS code with confidence scoring."""
        try:
            normalized = self.normalize_csi(raw_code)
        except ValueError as e:
            logger.error("Normalization failed: %s", e)
            return WBSMappingResult(
                original_input=raw_code,
                normalized_csi="INVALID",
                wbs_code=self.default_wbs,
                match_type="fallback",
                confidence=0.0,
            )

        # 1. Exact match (highest confidence)
        if normalized in self.exact_map:
            return WBSMappingResult(raw_code, normalized, self.exact_map[normalized], "exact", 1.0)

        # 2. Division-level match (BroadScope fallback)
        division_key = f"{normalized[:2]}-00-00"
        if division_key in self.division_map:
            return WBSMappingResult(raw_code, normalized, self.division_map[division_key], "division", 0.85)

        # 3. Cross-divisional routing (00/01 General Requirements)
        if normalized.startswith(("00-", "01-")):
            fallback_div = self.division_map.get("00-00-00", self.default_wbs)
            return WBSMappingResult(raw_code, normalized, fallback_div, "regex", 0.7)

        # 4. Hard fallback with audit warning
        logger.warning("No mapping found for %s, routing to default WBS", normalized)
        return WBSMappingResult(raw_code, normalized, self.default_wbs, "fallback", 0.1)

Production Pipeline Integration

In live environments, the mapper processes batches from ERP exports, Procore/PlanGrid APIs, or CSV cost reports. The integration layer must handle malformed inputs gracefully, preserve original metadata, and emit structured logs for compliance auditing. Python’s built-in logging module provides robust configuration for rotating audit files, as detailed in the Python Logging HOWTO.

def process_cost_codes_batch(
    mapper: CSIToWBSMapper,
    raw_codes: List[str]
) -> List[WBSMappingResult]:
    """Batch processes raw CSI strings with error isolation and audit logging."""
    results = []
    for code in raw_codes:
        try:
            result = mapper.resolve(code.strip())
            results.append(result)
            if result.match_type == "fallback":
                logger.info("AUDIT | Fallback routed | Input: %s | WBS: %s", code, result.wbs_code)
        except Exception as e:
            logger.critical("UNHANDLED EXCEPTION | Input: %s | Error: %s", code, str(e))
            results.append(WBSMappingResult(
                original_input=code,
                normalized_csi="ERROR",
                wbs_code=mapper.default_wbs,
                match_type="fallback",
                confidence=0.0
            ))
    return results

# Example usage
if __name__ == "__main__":
    # Enterprise mapping dictionaries (typically loaded from DB/JSON)
    EXACT_MAPPINGS = {
        "03-30-00": "WBS-CON-0330",
        "09-21-16": "WBS-FIN-0921",
        "26-05-19": "WBS-ELC-2605"
    }
    DIVISION_MAPPINGS = {
        "03-00-00": "WBS-CON-0300",
        "09-00-00": "WBS-FIN-0900",
        "00-00-00": "WBS-ADM-0000"
    }

    mapper = CSIToWBSMapper(
        exact_map=EXACT_MAPPINGS,
        division_map=DIVISION_MAPPINGS,
        default_wbs="WBS-ADM-UNMAPPED"
    )

    test_inputs = [
        "03-30-00",      # Exact match
        "09-21-16",      # Exact match
        "03-30-00-ALT1", # Suffix handling
        "03-30-12",      # Division fallback
        "01-00-00",      # Cross-divisional regex
        "123",           # Invalid length
        "26-05-19"       # Exact match
    ]

    resolved = process_cost_codes_batch(mapper, test_inputs)
    for r in resolved:
        print(f"{r.original_input:<15} -> {r.wbs_code:<18} | Type: {r.match_type:<10} | Conf: {r.confidence}")

Validation & Edge Case Handling

Production deployments require strict input validation and deterministic fallback behavior. The implementation above enforces digit-length constraints, strips vendor suffixes, and prevents silent failures by routing unmapped codes to auditable defaults. When integrating with scheduling engines like Primavera P6 or MS Project, ensure WBS codes adhere to hierarchical length limits (typically 10-16 characters). Always validate mapping dictionaries against the latest CSI MasterFormat release published by the Construction Specifications Institute before deploying to production.