Validating extracted RFI fields against custom JSON schemas
Automated document ingestion pipelines frequently extract Request for Information (RFI) data from PDFs, scanned submittals, and field-captured forms. The resulting payloads, however, rarely enter downstream systems in a pristine state. Raw OCR output routinely introduces type mismatches, truncated strings, malformed dates, and hallucinated fields. Without deterministic schema enforcement, these artifacts cascade into scheduling conflicts, cost tracking discrepancies, and compliance violations across the project lifecycle. A robust validation layer must normalize extraction noise, enforce construction-specific business rules, and route failures to targeted remediation queues before data enters the Automated Document Ingestion & Parsing ecosystem.
Schema Architecture for Construction Workflows
Generic type checking is insufficient for construction documentation. RFI validation requires a custom JSON schema that mirrors industry standards such as AIA G714 response workflows, CSI MasterFormat division codes, and ISO 19650 metadata conventions. The schema must encode domain-specific constraints rather than relying on primitive JSON types alone. Key architectural considerations include:
- Identifier Conventions: RFI numbers must follow project-specific alphanumeric patterns (e.g.,
RFI-YYYY-NNNNN) to prevent duplicate routing. - Trade & Scope Mapping: Classification codes must align with CSI MasterFormat divisions to ensure accurate subcontractor assignment and cost code routing.
- Impact Quantification: Cost and schedule impact fields require bounded numeric validation with explicit currency normalization.
- Calendar Alignment: Date fields must conform to ISO 8601 formats while rejecting impossible calendar values (e.g., February 30th).
- State Machine Enforcement: Status fields must restrict values to a predefined enum matching the project’s RFI lifecycle.
These constraints are formalized within the Schema Validation Rules framework to guarantee deterministic parsing behavior across heterogeneous document sources.
Production-Ready Python Implementation
The jsonschema library, paired with custom format validators, provides a production-grade mechanism for enforcing construction-aware constraints. The following implementation defines a complete validation engine, registers domain-specific format checkers, and aggregates all validation errors for structured remediation.
import jsonschema
from jsonschema import Draft7Validator, FormatChecker, ValidationError
from datetime import datetime
import re
from typing import Dict, Any, List, Tuple
import json
# Construction-specific format validators
def validate_rfi_number(instance: str) -> bool:
"""Enforces project-standard RFI numbering: RFI-YYYY-NNNNN"""
return bool(re.match(r"^(RFI|REQ)-\d{4}-\d{3,5}$", instance))
def validate_iso_date(instance: str) -> bool:
"""Validates YYYY-MM-DD format and rejects invalid calendar dates."""
try:
datetime.strptime(instance, "%Y-%m-%d")
return True
except ValueError:
return False
def validate_cost_impact(instance: Any) -> bool:
"""Normalizes currency strings and enforces realistic construction bounds."""
if instance is None or instance == "":
return True # Optional field
try:
val = float(str(instance).replace(",", "").replace("$", ""))
return -1_000_000 <= val <= 10_000_000
except ValueError:
return False
def validate_csi_division(instance: str) -> bool:
"""Matches 2-digit CSI MasterFormat division codes (01-50)."""
return bool(re.match(r"^(0[1-9]|1[0-9]|2[0-9]|3[0-9]|4[0-9]|50)$", instance))
# Register custom formats
format_checker = FormatChecker()
format_checker.checks("rfi_number")(validate_rfi_number)
format_checker.checks("iso_date")(validate_iso_date)
format_checker.checks("cost_impact")(validate_cost_impact)
format_checker.checks("csi_division")(validate_csi_division)
# Custom RFI Schema Definition
RFI_SCHEMA = {
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"required": ["rfi_number", "issue_date", "trade_code", "description", "status"],
"properties": {
"rfi_number": {"type": "string", "format": "rfi_number"},
"issue_date": {"type": "string", "format": "iso_date"},
"due_date": {"type": "string", "format": "iso_date"},
"trade_code": {"type": "string", "format": "csi_division"},
"description": {"type": "string", "minLength": 10, "maxLength": 2000},
"status": {"type": "string", "enum": ["Open", "Pending", "Answered", "Closed", "Withdrawn"]},
"cost_impact": {"type": ["string", "number", "null"], "format": "cost_impact"},
"schedule_impact_days": {"type": ["integer", "null"], "minimum": 0, "maximum": 365},
"originator": {"type": "string", "pattern": "^[A-Za-z0-9\\s\\-]+$"},
"attachments": {
"type": "array",
"items": {"type": "string", "format": "uri"}
}
},
"additionalProperties": False
}
def validate_rfi_payload(payload: Dict[str, Any]) -> Tuple[bool, List[Dict[str, Any]]]:
"""
Validates an extracted RFI payload against the custom construction schema.
Returns a tuple of (is_valid, list_of_structured_errors).
"""
errors: List[Dict[str, Any]] = []
if not isinstance(payload, dict):
return False, [{
"field": "$",
"message": "Payload must be a JSON object/dictionary.",
"validator": "type_check",
"value": type(payload).__name__
}]
try:
Draft7Validator(RFI_SCHEMA, format_checker=format_checker).validate(payload)
return True, []
except ValidationError as e:
# Aggregate all validation failures, not just the first encountered
validator = Draft7Validator(RFI_SCHEMA, format_checker=format_checker)
for error in validator.iter_errors(payload):
field_path = ".".join(map(str, error.absolute_path)) or "$"
errors.append({
"field": field_path,
"message": error.message,
"validator": error.validator,
"value": error.instance
})
return False, errors
# Execution block for local testing or CI integration
if __name__ == "__main__":
sample_payload = {
"rfi_number": "RFI-2024-00123",
"issue_date": "2024-05-15",
"due_date": "2024-05-22",
"trade_code": "03",
"description": "Clarification required for concrete pour sequence in Zone B.",
"status": "Open",
"cost_impact": "$12,500",
"schedule_impact_days": 2,
"originator": "Site Foreman Alpha"
}
is_valid, validation_errors = validate_rfi_payload(sample_payload)
print(f"Validation Result: {'PASS' if is_valid else 'FAIL'}")
if not is_valid:
print(json.dumps(validation_errors, indent=2))Pipeline Integration and Error Routing
Validating extracted fields is only effective when integrated into an asynchronous processing workflow. The validation function above returns a structured error array containing the exact JSON path, violated constraint, and raw extracted value. This output enables deterministic routing:
- Hard Failures: Missing required fields (
rfi_number,status) or malformed identifiers trigger immediate rejection. These payloads are routed to a manual review queue where project coordinators can attach corrected metadata. - Soft Failures: Optional fields with out-of-bounds values (e.g.,
schedule_impact_days: 400) or unrecognized trade codes trigger warning flags. The pipeline can auto-correct known patterns (e.g., stripping currency symbols) or escalate to estimators for verification. - Idempotent Processing: By rejecting
additionalProperties, the schema prevents downstream systems from ingesting hallucinated OCR artifacts. This aligns with the JSON Schema specification for strict data contracts.
For high-throughput environments, wrap the validation logic in an async worker that batches payloads, logs validation metrics to a centralized telemetry service, and publishes clean records to the project management database. Refer to the official python-jsonschema documentation for advanced draft-07 configuration and custom keyword registration.
Compliance and Standard Alignment
Construction automation must respect contractual and regulatory boundaries. The schema’s status enum mirrors AIA G714 response states, while trade_code validation ensures alignment with CSI MasterFormat divisions. Estimators and project managers rely on these constraints to maintain accurate change order logs and schedule baselines. By enforcing strict type boundaries, date normalization, and impact quantification, the validation layer eliminates silent data corruption before it reaches financial or scheduling modules.
When deployed alongside robust OCR preprocessing and field extraction techniques, this validation pipeline guarantees that only structurally sound, construction-aware RFI data propagates through the project lifecycle.