OCR on statements that do not cooperate
Every BSA vendor works on a clean HDFC NetBanking export. Lender accuracy — the number that determines whether the underwriting pipeline keeps running — is set by what happens on the dot-matrix PSU scan, the Karnataka State Co-op statement, and the multi-generation photocopy retrieved from branch archives. TransactIQ is engineered for those inputs, not against them.
The five hardest input classes
These are the statements where incumbent OCR routinely drops below useable accuracy. NBFC underwriting teams end up re-keying them manually — which is slow, expensive, and introduces its own error class.
Dot-matrix PSU bank scans
A branch prints the statement on a dot-matrix impact printer, scans the page on a flatbed MFP at 150 DPI, and emails the PDF. The glyph shapes are dotted rather than continuous; columns run out of alignment after the second page. Generic OCR models read these as noise.
Cooperative bank formats
Urban and district central co-operative banks often export statements from accounting packages never standardised across the sector. Column labels are in Kannada, Marathi, or Tamil; dates are ambiguous between DD/MM and MM/DD; running balance is shown in a fourth column that varies by software vendor.
Password-protected PDFs
Many private-bank exports ship password-protected (usually PAN + date of birth format). The borrower needs to share the password for the analysis to proceed. TransactIQ handles the decryption path inside the VPC — the password never lands in a message log.
Multi-column and multi-bank PDFs
A consolidated CA-generated PDF can stack three bank statements in one file, each with its own column order. Date columns interleave. Opening balance on page 7 refers to account 3, not account 1. Generic BSA vendors either reject the file or extract it incorrectly.
Fax-origin and photocopy noise
Compliance-archived statements retrieved from branch warehouses are often multi-generation photocopies — text is legible to a human but below the threshold of structural recognition for most OCR models. TransactIQ is benchmarked specifically on this input class.
How TransactIQ approaches the problem
Four design decisions that distinguish TransactIQ's OCR layer from the generic stack a BSA vendor can buy off the shelf.
Trained on degraded inputs, not clean ones
The OCR stack's training and benchmarking corpus is weighted toward the hard cases described above — because those are the statements where lender accuracy actually breaks. A vendor that trains mostly on HDFC NetBanking PDFs will report impressive average accuracy and still fail the 20% of inputs that matter.
Structure-aware, not glyph-only
OCR is only part of the problem; recovering the ledger structure (date / narration / debit / credit / running balance) across unusual layouts is the other part. TransactIQ uses structure-recognition models alongside character recognition so that misaligned columns and split-cell layouts still produce a well-formed ledger.
Post-OCR reconciliation
Every processed statement is self-checked by reconciling line-level debit/credit against running balance, flagging cells where extraction is suspect. Flagged cells are re-processed or surfaced to the lender's workflow — they do not silently enter the credit signal pipeline.
Continuous benchmarking
Accuracy is reported per bank category and per statement type on a rolling basis — not a static marketing number. Early-access partners receive the quarterly benchmark report covering the banks most relevant to their portfolio.
Test TransactIQ on your hardest statements
Early-access lenders can submit a representative sample from their portfolio and receive a bank-category accuracy report before production integration.
Request accuracy benchmark