TransactIQ · OCR Engine

OCR on statements that do not cooperate

Every BSA vendor works on a clean HDFC NetBanking export. Lender accuracy — the number that determines whether the underwriting pipeline keeps running — is set by what happens on the dot-matrix PSU scan, the Karnataka State Co-op statement, and the multi-generation photocopy retrieved from branch archives. TransactIQ is engineered for those inputs, not against them.

The five hardest input classes

These are the statements where incumbent OCR routinely drops below useable accuracy. NBFC underwriting teams end up re-keying them manually — which is slow, expensive, and introduces its own error class.

Dot-matrix PSU bank scans

A branch prints the statement on a dot-matrix impact printer, scans the page on a flatbed MFP at 150 DPI, and emails the PDF. The glyph shapes are dotted rather than continuous; columns run out of alignment after the second page. Generic OCR models read these as noise.

Cooperative bank formats

Urban and district central co-operative banks often export statements from accounting packages never standardised across the sector. Column labels are in Kannada, Marathi, or Tamil; dates are ambiguous between DD/MM and MM/DD; running balance is shown in a fourth column that varies by software vendor.

Password-protected PDFs

Many private-bank exports ship password-protected (usually PAN + date of birth format). The borrower needs to share the password for the analysis to proceed. TransactIQ handles the decryption path inside the VPC — the password never lands in a message log.

Multi-column and multi-bank PDFs

A consolidated CA-generated PDF can stack three bank statements in one file, each with its own column order. Date columns interleave. Opening balance on page 7 refers to account 3, not account 1. Generic BSA vendors either reject the file or extract it incorrectly.

Fax-origin and photocopy noise

Compliance-archived statements retrieved from branch warehouses are often multi-generation photocopies — text is legible to a human but below the threshold of structural recognition for most OCR models. TransactIQ is benchmarked specifically on this input class.

How TransactIQ approaches the problem

Four design decisions that distinguish TransactIQ's OCR layer from the generic stack a BSA vendor can buy off the shelf.

Trained on degraded inputs, not clean ones

The OCR stack's training and benchmarking corpus is weighted toward the hard cases described above — because those are the statements where lender accuracy actually breaks. A vendor that trains mostly on HDFC NetBanking PDFs will report impressive average accuracy and still fail the 20% of inputs that matter.

Structure-aware, not glyph-only

OCR is only part of the problem; recovering the ledger structure (date / narration / debit / credit / running balance) across unusual layouts is the other part. TransactIQ uses structure-recognition models alongside character recognition so that misaligned columns and split-cell layouts still produce a well-formed ledger.

Post-OCR reconciliation

Every processed statement is self-checked by reconciling line-level debit/credit against running balance, flagging cells where extraction is suspect. Flagged cells are re-processed or surfaced to the lender's workflow — they do not silently enter the credit signal pipeline.

Continuous benchmarking

Accuracy is reported per bank category and per statement type on a rolling basis — not a static marketing number. Enterprise customers receive the quarterly benchmark report covering the banks most relevant to their portfolio.

More about TransactIQ

Architecture → Bank coverage → Analytics → API posture → Security → Deployment →

Test TransactIQ on your hardest statements

Evaluating lenders can submit a representative sample from their portfolio and receive a bank-category accuracy report before production integration.

Request accuracy benchmark