Skip to main content
Credit · BSA · Diligence

Bank Statement Analyzer Accuracy Benchmark: A Bake-Off Framework for Lenders

Every BSA vendor claims 95%+ accuracy. On what statement mix, measured how, against which corpus? This framework normalises competing accuracy claims with a selection-bias penalty, flags coverage gaps against your portfolio, and produces a 10-item bake-off checklist you can run before the contract lands on legal.

How this works

Step 1

Describe your portfolio mix

Statement source mix and the edge formats your underwriting actually sees. Determines what coverage a vendor genuinely needs to support you.

Step 2

Enter the three vendor claims

Claimed accuracy, claimed bank count, and the test data source for each. The test data source drives the bias penalty.

Step 3

Read the normalised scores

Apples-to-apples normalised accuracy, coverage gap flags, and a 10-item bake-off checklist for the diligence pack.

Your portfolio profile

Monthly statement mix
Statement format types you see

Distinct banks your loan book draws statements from across all geographies.

Vendor claims

Vendor A
Vendor B
Vendor C

Normalised comparison

Vendor A
Normalised accuracy
80.8%
Claimed 95% − 15% bias penalty
Unstated source
Bank count below your need
Vendor B
Normalised accuracy
80.8%
Claimed 95% − 15% bias penalty
Unstated source
Bank count below your need
Vendor C
Normalised accuracy
80.8%
Claimed 95% − 15% bias penalty
Unstated source
Bank count below your need

How to read the output

The normalised score is the vendor's claim discounted for selection bias. A claim measured on the vendor's own corpus carries a 15% discount because the corpus is almost always tuned to the parser. A public benchmark carries a 5% discount because public benchmarks still have selection effects, just smaller. A customer-supplied corpus carries no discount — this is the only test that matters. An unstated source is treated as the worst case until proven otherwise.

The coverage gap flag fires when the claimed bank count is below the bank count your portfolio actually needs. This is the silent failure mode in vendor demos — the demo runs on 50 banks the vendor curated, your production runs on 150 banks that include the long tail of co-operatives and regional rural banks.

The bake-off checklist below the comparison is the actual diligence work. The normalised score gets you to a shortlist. The checklist gets you to a contract you will not regret in 12 months.

Things to actually test

Run every shortlisted vendor against this list on your own corpus. The vendor that ships demos passing items 1, 2, and 3 is the one to take seriously.

  1. 1
    Degraded scan accuracy — your own corpus of 100+ scanned PDFs at 200dpi or below
  2. 2
    PSU dot-matrix coverage — SBI, BoB, PNB, Canara, Union Bank pre-2018 formats
  3. 3
    Password-protected statement handling — without logging the password anywhere recoverable
  4. 4
    Co-op bank UTR extraction — the format that most often silently drops UTRs
  5. 5
    50-page statement performance — latency, memory, and accuracy on long statements
  6. 6
    Multi-account aggregation — same borrower, multiple banks, single normalised output
  7. 7
    Signal coverage — does the analyzer produce 40+ engineered credit signals, or just 6 to 10?
  8. 8
    AA payload normalisation parity — same borrower on AA vs PDF produces the same score
  9. 9
    Real customer corpus test — 100+ statements you have actually underwritten, not vendor samples
  10. 10
    API latency benchmark — p50, p95, p99 with concurrent calls at your production volume

Why vendor accuracy claims are unreliable

Every BSA vendor pitch deck cites an accuracy number in the high nineties. The number is rarely a lie — it is usually defensible on the corpus it was measured on. The problem is that the corpus is not your portfolio. The accuracy gap between a vendor's headline number and what you will see in production at the end of month one is consistently 8 to 15 percentage points, sometimes more on PSU-heavy or co-op-heavy books.

Four common claim-inflation patterns recur in lender BSA diligence. The first is corpus selection bias — the vendor tests on the statements they have tuned for, which over-represents native private-bank PDFs. The second is metric definition drift — field-level accuracy and transaction-level accuracy and statement-level accuracy are all "accuracy", but they can differ by 10 points on the same corpus, and the vendor will cite whichever is highest. The third is freshness drift — an accuracy number from a 2024 benchmark is stale because every major bank has changed at least one statement layout since. The fourth is the segment-vs-aggregate trick — "95% accuracy" is the average across segments where one segment (private banks) is at 98% and another (co-op banks) is at 76%.

The only test that matters is your own corpus. Assemble 500 to 2,000 statements that mirror your portfolio distribution, mask the PII centrally, define a single accuracy metric, and run every shortlisted vendor through the same set. The bake-off can be done in three to four weeks if the procurement clock is moving. The lenders that do this end up with BSAs that hold their accuracy claim through production scaling. The lenders that do not end up with a different problem six months later.

TransactIQ ships the accuracy posture transparently — 200+ banks supported across private, PSU, co-operative, and regional rural; ISO 27001:2022 certified; deployed on AWS Mumbai; 40+ engineered credit signals; four-layer MSME synthetic financials. The 51% to 88% match rate improvement is from a live customer deployment and is the number we will defend on your corpus.

Related

Product

TransactIQ

Bank statement intelligence and analyzer API for NBFC underwriting.

Coverage

Bank coverage

200+ banks across private, PSU, co-operative, and regional rural.

Architecture

Architecture posture

Deployment, security, audit trail, latency profile.

Tool

BSA Build vs Buy Calculator

Three-year TCO of building an in-house BSA versus licensing one.

Frequently Asked Questions

Why are vendor accuracy claims usually inflated? +

Four reasons. First, selection bias on the test corpus — vendors test on the statements their parser was tuned for, which over-represents clean private-bank native PDFs and under-represents the degraded PSU and co-operative bank statements that actually fail in production. Second, the accuracy metric is rarely defined the same way — field-level, transaction-level, statement-level, and end-to-end accuracy can differ by 8 to 15 percentage points on the same corpus. Third, the public benchmark used (when one is cited) is usually a single-distribution dataset that does not reflect lender portfolio mix. Fourth, the corpus size cited is sometimes too small to be statistically meaningful — a claim made on 200 statements has wide confidence intervals.

What is a realistic accuracy on PSU dot-matrix statements? +

On legacy PSU dot-matrix statements — printed from pre-2015 core banking and scanned to PDF — production-grade extractors typically operate at 72 to 85% field-level accuracy, with substantial variance across SBI, Bank of Baroda, PNB, and the smaller PSU banks. The bottleneck is OCR character recognition on degraded scans where ink density varies line-to-line. Claims above 90% on this segment should trigger an immediate test on your own corpus — they are achievable on a curated sub-segment but very rarely on the full PSU dot-matrix population.

How do I run an apples-to-apples bake-off? +

Assemble a corpus from your own production statements (not vendor-supplied) that mirrors your portfolio distribution — same share of private, PSU, co-operative, and AA payloads, same share of scanned vs native, same share of password-protected, same share of multi-account aggregation. Mask PII centrally. Define a single accuracy metric ahead of time and write down what counts as a correct extraction. Run all vendors on the same corpus. Measure end-to-end accuracy (extraction + categorisation + signal output), not just transaction extraction. Track latency at concurrent-call volume, not at single-statement volume. Score on the segments you actually lend to, not the aggregate.

What is the minimum corpus size for a real benchmark? +

Five hundred statements is the floor for a defensible per-segment accuracy claim with reasonable confidence intervals. One thousand to two thousand is the working standard for vendor diligence at a serious lender. Below 200, the confidence intervals are wide enough that two vendors with apparently different accuracy may not be statistically distinguishable. The corpus should also be stratified — at least 100 statements per major bank category (private, PSU, co-op, RRB) and at least 50 each on edge formats (password-protected, dot-matrix, multi-account).

Why do AA-vs-PDF accuracy differ? +

Account Aggregator payloads arrive as structured JSON with normalised transaction records, so the extraction problem is essentially solved at the source — accuracy on AA is usually above 99% on the extraction step. The accuracy claim that matters on AA is the normalisation parity question: does the analyzer produce the same categorisation, signal output, and credit score on a borrower's AA payload as it does on the borrower's PDF statement? If a borrower comes through AA on one application and PDF on the next, the underwriting decision should not change because the channel changed. Many analyzers have a 4 to 8 point accuracy gap between PDF and AA on the same borrower.

Test TransactIQ on your own corpus

We will defend the 51% to 88% match rate improvement on a corpus you supply. ISO 27001:2022, AWS Mumbai, 200+ banks, four-layer MSME synthetic financials.

Run a bake-off See TransactIQ →