OCR Preprocessing for Construction Docs

Construction document ingestion operates under strict financial and scheduling tolerances. A single misread change-order line item or a misaligned revision stamp can cascade into budget overruns, disputed quantities, and critical-path delays. The specific problem this page solves is the raster boundary: turning folded, low-contrast, hand-annotated scans — the AIA submittals, marked-up drawings, and faxed change orders that field teams actually send — into clean, deskewed, confidence-scored images that an extraction engine can trust. Generic optical character recognition cannot do this on raw scans; it inherits every fold line, halftone shade, and skew angle as noise and silently corrupts cost figures. Preprocessing is therefore the foundational layer of any Automated Document Ingestion & Parsing pipeline: it runs before field extraction techniques pull structured values and well before change order schema validation commits anything to the ledger. This guide covers the deterministic Python pipeline that normalizes those documents, scores each page, and routes it using the confidence bands shared across the whole system.

Prerequisites

Before building the preprocessing layer, confirm the following packages, infrastructure, and upstream assumptions are in place. The core is pure CPU image processing; the OCR engine and downstream contracts are shared with the rest of the ingestion pipeline.

Python 3.11+ — for Literal, Decimal, and modern typing used in the page contract.
opencv-python>=4.9 — adaptive thresholding, Hough-based deskew, and morphological grid suppression.
PyMuPDF>=1.24 (fitz) — high-fidelity PDF rasterization with explicit DPI control; pdfplumber is the alternative when native text layers are present.
numpy>=1.26 — array math for the projection-profile and Laplacian confidence proxies.
pydantic>=2.5 — the per-page output contract uses Pydantic v2 (field_validator with @classmethod, model_dump_json).
pytesseract>=0.3 + Tesseract 5 — only on the worker tier; preprocessing produces its input, never the other way around.
Upstream assumption: documents have already been classified by type and de-duplicated by content hash at the ingestion gateway, and multi-format reconciliation is handled by the PDF/Excel Sync Pipelines framework before a page reaches this stage. Preprocessing assumes it is looking at one known document; it does not detect document type.

Architecture Detail

The subsystem is a single-page-at-a-time funnel: a rasterized page in, a cleaned grayscale TIFF plus a scored metadata record out, with explicit branches for pages too degraded to trust. Native text-bearing PDFs skip rasterization entirely; only pages detected as image-only enter the deskew-and-binarize path. Every page exits with an extraction_confidence that drives the same routing bands used downstream — a stamped, clean drawing clears 0.92 and flows straight to extraction, a faded pencil markup lands in human review, and an unreadable fax quarantines rather than injecting garbage into the cost ledger.

The routing decision is governed by site-canonical confidence thresholds that recur throughout this pipeline: a page-quality confidence of 0.92 or higher auto-routes, 0.75 to 0.92 diverts to human review, and below 0.75 quarantines the page so a document-control specialist can re-capture or manually transcribe it before any data reaches extraction.

Step-by-Step Implementation

The pipeline follows a strict, idempotent lifecycle. Each step carries a construction-specific rationale, because the failure cost here is a corrupted cost figure, not a dropped log line.

Classify and short-circuit native text. If a page already carries a reliable embedded text layer, skip rasterization — re-OCRing a digitally generated change order only adds error. Fall back to controlled rasterization only for image-only pages.
Rasterize deterministically. Render image-only pages to 300 DPI grayscale, which balances OCR accuracy against the storage overhead of multi-gigabyte submittal packages. Pin the DPI so a retry produces byte-identical input and the step stays idempotent.
Deskew before binarizing. Field photography and flatbed scans arrive rotated. Detect the dominant horizontal axis of title blocks and revision schedules with a Hough line transform and rotate to correct it, because a 2-degree skew is enough to break row-to-column alignment in a tabular cost breakdown.
Binarize adaptively, not globally. Apply local adaptive thresholding (Sauvola, Niblack, or Gaussian) rather than a single global cutoff, so faint pencil change-order notes survive against the dense background grids of architectural prints.
Suppress grid lines. Use morphological operations to attenuate drawing grids and form rules that OCR would otherwise read as stray characters, while preserving the cell boundaries that anchor tabular parsing.
Score per region, then route. Compute a confidence proxy for the cleaned page and map it onto the canonical bands. Low-confidence zones — overlapping stamps, blurred handwriting — are flagged for the handling OCR drift in scanned construction blueprints localized treatment rather than passed through silently.

Idempotency is the load-bearing decision: because retries are inevitable on flaky site connectivity and broker redelivery, every output is keyed by source hash and page index and cached, so replaying a document never re-runs expensive OCR on a page it already cleaned. The module below is runnable and demonstrates the full lifecycle with strict typing, a Pydantic v2 contract, and structured logging for audit trails.

from __future__ import annotations

import logging
from enum import Enum
from pathlib import Path
from typing import Any, Literal

import cv2
import fitz  # PyMuPDF
import numpy as np
from pydantic import BaseModel, Field, ValidationError, field_validator

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s | %(levelname)-8s | %(name)s | %(message)s",
    datefmt="%Y-%m-%dT%H:%M:%S",
)
logger = logging.getLogger("ocr_preprocessing")

# Canonical confidence bands shared across every routing decision on the site.
AUTO_ROUTE_THRESHOLD = 0.92
QUARANTINE_THRESHOLD = 0.75

Discipline = Literal["ARCH", "STR", "MEP", "CIV", "ELEC", "PLMB"]


class RoutingState(str, Enum):
    """Where a preprocessed page goes, derived from its quality confidence."""

    AUTO_ROUTE = "auto_route"
    HUMAN_REVIEW = "human_review"
    QUARANTINE = "quarantine"


def route_for_confidence(score: float) -> RoutingState:
    """Map a page-quality confidence onto the canonical routing bands."""
    if score >= AUTO_ROUTE_THRESHOLD:
        return RoutingState.AUTO_ROUTE
    if score >= QUARANTINE_THRESHOLD:
        return RoutingState.HUMAN_REVIEW
    return RoutingState.QUARANTINE


class PreprocessedPage(BaseModel):
    """Strict contract emitted for every preprocessed construction page."""

    page_index: int = Field(..., ge=0)
    source_hash: str = Field(..., min_length=8, description="SHA-256 of source document")
    output_path: str
    dpi: int = Field(..., ge=150, le=600)
    deskew_angle: float = Field(..., ge=-15.0, le=15.0)
    extraction_confidence: float = Field(..., ge=0.0, le=1.0)
    discipline: Discipline = "ARCH"

    @field_validator("deskew_angle")
    @classmethod
    def clamp_extreme_skew(cls, v: float) -> float:
        # A skew beyond +/-15 deg means deskew detection latched onto the
        # wrong axis (often a vertical grid line); reject rather than rotate.
        if abs(v) > 15.0:
            raise ValueError("Skew exceeds +/-15 deg; likely a detection error")
        return v

    @property
    def routing_state(self) -> RoutingState:
        return route_for_confidence(self.extraction_confidence)


def compute_skew_angle(gray: np.ndarray) -> float:
    """Hough-based skew detection tuned for construction title blocks."""
    edges = cv2.Canny(gray, 50, 150, apertureSize=3)
    lines = cv2.HoughLinesP(
        edges, 1, np.pi / 180, threshold=100, minLineLength=100, maxLineGap=10
    )
    if lines is None:
        return 0.0
    angles: list[float] = []
    for line in lines:
        x1, y1, x2, y2 = line[0]
        angle = np.degrees(np.arctan2(y2 - y1, x2 - x1))
        # Keep near-horizontal lines; drop near-vertical grid rules that
        # would otherwise drag the median toward a false 90-degree rotation.
        if -15.0 < angle < 15.0:
            angles.append(angle)
    return float(np.median(angles)) if angles else 0.0


def preprocess_page(
    img_bgr: np.ndarray, page_index: int, source_hash: str, output_dir: str, dpi: int = 300
) -> PreprocessedPage:
    """Deskew, binarize, score, and persist a single rasterized page."""
    gray = cv2.cvtColor(img_bgr, cv2.COLOR_BGR2GRAY)

    # Adaptive thresholding preserves faint pencil annotations against grids.
    thresh = cv2.adaptiveThreshold(
        gray, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 25, 10
    )

    angle = compute_skew_angle(thresh)
    h, w = thresh.shape[:2]
    matrix = cv2.getRotationMatrix2D((w // 2, h // 2), angle, 1.0)
    rotated = cv2.warpAffine(
        thresh, matrix, (w, h), flags=cv2.INTER_CUBIC, borderMode=cv2.BORDER_REPLICATE
    )

    # Confidence proxy: Laplacian variance (sharpness) normalized to [0, 1].
    # A degraded fax scores low and is held back from the auto-route path.
    variance = cv2.Laplacian(rotated, cv2.CV_64F).var()
    confidence = float(min(1.0, variance / 800.0))

    out_file = Path(output_dir) / f"{source_hash[:12]}_page_{page_index:03d}.tiff"
    cv2.imwrite(
        str(out_file),
        rotated,
        [cv2.IMWRITE_TIFF_COMPRESSION, cv2.IMWRITE_TIFF_COMPRESSION_NONE],
    )

    return PreprocessedPage(
        page_index=page_index,
        source_hash=source_hash,
        output_path=str(out_file),
        dpi=dpi,
        deskew_angle=round(angle, 3),
        extraction_confidence=round(confidence, 4),
    )


def preprocess_document(
    pdf_path: str, source_hash: str, output_dir: str, target_dpi: int = 300
) -> list[dict[str, Any]]:
    """
    Rasterize image-only pages, preprocess each, and emit validated records.

    Native text-bearing pages are skipped here and handed straight to the
    extraction layer; only rasterized pages need deskew and binarization.
    """
    path = Path(pdf_path)
    if not path.is_file():
        raise FileNotFoundError(f"Input document missing: {pdf_path}")

    Path(output_dir).mkdir(parents=True, exist_ok=True)
    results: list[dict[str, Any]] = []

    try:
        with fitz.open(str(path)) as doc:
            for idx, page in enumerate(doc):
                # Short-circuit pages that already carry reliable text.
                if page.get_text("text").strip():
                    logger.info("Page %d has native text; skipping rasterization", idx)
                    continue

                pix = page.get_pixmap(dpi=target_dpi)
                img = np.frombuffer(pix.samples, dtype=np.uint8).reshape(
                    pix.height, pix.width, pix.n
                )
                if pix.n == 4:  # drop alpha if present
                    img = cv2.cvtColor(img, cv2.COLOR_RGBA2BGR)

                try:
                    record = preprocess_page(img, idx, source_hash, output_dir, target_dpi)
                except ValidationError as ve:
                    logger.warning("Page %d failed the contract: %s", idx, ve.errors())
                    continue

                logger.info(
                    "Page %d | skew %.2f deg | conf %.3f -> %s",
                    record.page_index, record.deskew_angle,
                    record.extraction_confidence, record.routing_state.value,
                )
                results.append(record.model_dump())

        logger.info("Preprocessed %d image-only pages", len(results))
        return results

    except Exception as exc:  # noqa: BLE001 - record, then escalate to the DLQ
        logger.exception("Critical preprocessing failure for %s", path.name)
        raise RuntimeError(f"OCR preprocessing aborted: {exc}") from exc

Two choices carry the weight here. First, the page contract is enforced in the type system: an impossible skew angle or an out-of-range confidence raises at construction time, so a malformed page record can never be written, let alone routed. Second, the routing decision is a pure function of the confidence score, so the same 0.92 and 0.75 thresholds govern preprocessing exactly as they govern the async queue architecture and field extraction downstream — routing is never re-litigated per worker.

Schema and Configuration Reference

The table below defines the fields and tuning keys that govern this subsystem. The extraction_confidence and deskew_angle are the two values downstream stages depend on most.

Field / key	Type	Rule	Construction rationale
`page_index`	`int`	`>= 0`	Stable position for idempotent retries
`source_hash`	`str`	length `>= 8`	Content key; prevents re-OCR of cleaned pages
`dpi`	`int`	`150`–`600`	300 balances OCR accuracy vs. package size
`deskew_angle`	`float`	`-15.0`–`15.0`	Beyond this, detection latched on the wrong axis
`extraction_confidence`	`float`	`0.0`–`1.0`	Drives the routing decision
`discipline`	`Literal`	ARCH/STR/MEP/CIV/ELEC/PLMB	Closed enum; no free-text drift
`AUTO_ROUTE_THRESHOLD`	config	`0.92`	Auto-route at or above
`QUARANTINE_THRESHOLD`	config	`0.75`	Below this, quarantine
`block_size`	config	odd, `15`–`31`	Adaptive-threshold window; larger for dense grids
`C` (threshold offset)	config	`5`–`15`	Bias that preserves faint pencil marks

Verification and Testing

Confirm correct behavior with assertions that exercise the confidence router, the skew clamp, and the contract. Synthetic fixtures keep the test deterministic without depending on real scanner hardware.

import numpy as np


def test_confidence_router() -> None:
    assert route_for_confidence(0.95) is RoutingState.AUTO_ROUTE
    assert route_for_confidence(0.80) is RoutingState.HUMAN_REVIEW
    assert route_for_confidence(0.50) is RoutingState.QUARANTINE


def test_contract_rejects_extreme_skew() -> None:
    try:
        PreprocessedPage(
            page_index=0, source_hash="abcd1234ef", output_path="/tmp/p.tiff",
            dpi=300, deskew_angle=27.0, extraction_confidence=0.9,
        )
        raise AssertionError("Expected ValidationError on extreme skew")
    except ValidationError:
        pass


def test_skew_detection_on_synthetic_grid() -> None:
    # A blank near-horizontal line should yield a near-zero skew estimate.
    canvas = np.zeros((200, 400), dtype=np.uint8)
    canvas[100, 50:350] = 255
    assert abs(compute_skew_angle(canvas)) < 1.0


def test_routing_state_property() -> None:
    page = PreprocessedPage(
        page_index=2, source_hash="deadbeef99", output_path="/tmp/p.tiff",
        dpi=300, deskew_angle=1.2, extraction_confidence=0.6,
    )
    assert page.routing_state is RoutingState.QUARANTINE

Run the module against a sample package (python ocr_preprocessing.py) and watch the structured log: a clean drawing should report a small skew and auto_route, while a degraded scan should land in human_review or quarantine. Use PreprocessedPage(...).model_dump_json() to snapshot a record for golden-file comparison in CI.

Troubleshooting

Every page quarantines on a clean batch. Root cause: the Laplacian confidence proxy is mis-scaled for your scanner — its sharpness variance never approaches 800. Fix: calibrate the normalization divisor against a known-good page rather than treating 800.0 as universal, then re-baseline the thresholds.
Deskew rotates pages 90 degrees. Root cause: the Hough step latched onto vertical title-block rules instead of horizontal text baselines. Fix: the -15.0 < angle < 15.0 filter and the clamp_extreme_skew validator together reject that case; confirm both are active and that the page was binarized before skew detection.
Faint pencil change-order notes vanish after binarization. Root cause: global thresholding, or an adaptive block_size/C tuned for clean text. Fix: lower C and widen the adaptive window so local contrast preserves light strokes against dense grids; for chronic cases route to the handling OCR drift in scanned construction blueprints treatment.
Native digital PDFs come out worse after preprocessing. Root cause: the rasterize path ran on a page that already had a text layer. Fix: the page.get_text("text") short-circuit skips text-bearing pages; verify classification is not forcing every page down the OCR branch.
Memory spikes on large submittal packages. Root cause: rendering an entire multi-gigabyte PDF before processing. Fix: stream page by page as shown, keep dpi at 300 unless accuracy demands more, and offload work to the async queue architecture so a single oversized package cannot monopolize a worker.

Frequently Asked Questions

Why deskew before adaptive thresholding instead of after?

Adaptive thresholding evaluates local neighborhoods, and a skewed page smears text baselines across the window, lowering local contrast and erasing faint marks. Correcting rotation first keeps each row aligned so the threshold window sees clean stroke-versus-background contrast. Deskew on the binarized image is fine for angle detection, but the final rotation should produce the image extraction consumes.

What DPI should I rasterize construction drawings at?

300 DPI grayscale is the default that balances OCR accuracy against the storage and memory cost of large packages. Drop to 200 only for text-dense, clean documents where speed matters; raise toward 400–600 only for dense engineering drawings with fine annotations, and expect a proportional jump in memory per page.

How do the confidence thresholds map to routing states?

A page-quality confidence of 0.92 or higher auto-routes straight to field extraction; 0.75 to 0.92 diverts to a human-review queue; and anything below 0.75 quarantines the page for re-capture or manual transcription. These bands are identical across the ingestion pipeline, so a page’s fate is predictable regardless of which subsystem scored it.

Why is preprocessing idempotent, and how is that enforced?

Site connectivity is unreliable and brokers redeliver messages, so the same page is reprocessed often. Keying every output by source_hash and page_index and caching the cleaned TIFF means a retry returns the existing result instead of re-running expensive OCR, which keeps throughput predictable and prevents duplicate review-queue entries.

Does the Laplacian variance equal real OCR accuracy?

No — it is a cheap sharpness proxy that correlates with downstream OCR success but does not measure it directly. Use it to triage pages into the routing bands, then refine the score with the actual per-region OCR confidence once Tesseract has run. Calibrate the normalization divisor against your own scanner fleet rather than trusting a single constant.

← Back to Automated Document Ingestion & Parsing

For parameter tuning across varying scanner hardware, see OpenCV’s adaptive thresholding documentation and the PyMuPDF rendering API.

OCR Preprocessing for Construction Docs

Explore in this section