Monday, February 10, 2025

Parsing PDF Documents at Scale

Abdellatif Abdelfattah
Abdellatif Abdelfattah
KnowledgeProduct

Parsing PDF Documents at Scale

PDFs are everywhere in business, but they're notoriously difficult to work with. When you need to process thousands of them for an AI system, those difficulties compound quickly.

We've processed over 3 million PDFs across various document types—invoices, contracts, research papers, and forms. Here's what we learned about doing it at scale.

Why PDFs Are Hard

PDFs were designed for printing, not data extraction. That single design decision creates cascading problems:

  • Text might be images, not selectable text
  • Tables aren't really tables—they're just text positioned to look like tables
  • Reading order isn't always obvious
  • Pages can mix multiple columns, sidebars, and nested layouts
  • There's no standard structure across different PDFs

Each document type presents unique challenges. We've seen invoices with 200+ different layouts, contracts that embed critical terms in nested tables, and scanned documents where the same vendor sends varying quality scans week to week.

The Real Challenge: Scale

Processing one PDF manually? Easy. Processing 10,000 PDFs automatically? That's where things get interesting.

At scale, you need:

  • Speed: Processing minutes per document adds up fast. When we started, our average was 3 minutes per invoice—that's 500 hours for 10,000 documents. We now process the same batch in under 8 hours.
  • Consistency: The same quality across all documents. Early on, our extraction accuracy varied from 95% on clean PDFs to 60% on scanned ones.
  • Error Handling: Graceful failures for problematic files. About 3-5% of documents in any batch will have issues—corruption, encryption, unusual formats.
  • Quality Control: Knowing when extraction worked and when it didn't. We learned this the hard way after processing 50,000 documents before realizing a parser change broke date extraction.

Building a Processing Pipeline

A scalable PDF processing system needs several stages working together:

1. Document Triage

Before diving into processing, figure out what you're dealing with:

  • Is this a scanned image or digital text?
  • What type of document is it (invoice, contract, report)?
  • How many pages?
  • Is it password-protected or corrupted?

This initial classification determines which processing strategy to use. We route documents into different pipelines based on type—what works for invoices fails spectacularly on legal contracts.

In our experience, roughly 60% of business documents are digital PDFs, 30% are scanned, and 10% are mixed. Your mileage will vary.

2. Text Extraction

Different documents need different approaches:

Digital PDFs with selectable text are straightforward. We use PDFPlumber for most cases, which handles layout preservation well. Extraction accuracy typically exceeds 98% for clean digital PDFs.

Scanned PDFs require OCR (Optical Character Recognition). The quality of your OCR directly impacts everything downstream. We've tested Tesseract, Google Cloud Vision, and AWS Textract extensively. Google Cloud Vision gives us the best accuracy (around 92-94% on typical business documents) but costs more. Tesseract is free but averages 85-88% accuracy without preprocessing.

Sometimes basic preprocessing—like straightening skewed pages or adjusting contrast—makes a huge difference. We saw a 12% accuracy improvement on one batch of contracts just by adding deskewing.

Mixed PDFs contain both digital text and scanned images. You need to detect which is which and apply the appropriate method to each section. These are the trickiest—about 15% of our error cases come from mixed PDFs where we chose the wrong extraction method.

3. Structure Recognition

Raw text isn't enough. You need to understand document structure:

  • Where are the headings versus body text?
  • Which elements are tables, and what are their boundaries?
  • What's in headers and footers versus main content?
  • How do multi-column layouts flow?

This is where many systems fail. They extract text just fine but lose the meaning embedded in the structure. We once spent two weeks debugging an invoice processor only to discover it was reading footer totals as line items.

4. Table Extraction

Tables deserve special attention because they're both common and challenging. The best approach depends on the table type:

  • Bordered tables are easier—you can detect lines and extract cell contents. We get 90%+ accuracy on these.
  • Borderless tables require understanding spacing and alignment. Accuracy drops to 70-80% without custom rules.
  • Complex tables with merged cells and nested structures need specialized handling. Honestly, these still trip us up—we route them to manual review about 40% of the time.

Tools like Camelot or Tabula can handle most cases, but expect to write custom logic for unusual formats. We maintain separate extraction rules for our top 20 vendors because their table formats are so distinctive.

5. Data Normalization

Extracted data is messy. Dates appear in dozens of formats. Amounts might have currency symbols, commas, or multiple decimal places. Names and addresses vary wildly.

Normalization turns this chaos into consistent, usable data. It's unglamorous work that makes everything downstream easier.

We built a normalization layer that handles 47 different date formats, 12 currency formats, and various address patterns. It still fails on about 2% of documents—usually edge cases like fiscal year dates or multi-currency invoices.

Handling Problem Documents

Every large-scale PDF processing system encounters documents that break the rules:

  • Corrupted files that won't open (0.5-1% of batches)
  • Encrypted documents requiring passwords (2-3% for contracts, rare for invoices)
  • Extremely large files that exhaust memory (anything over 100MB needs special handling)
  • Unusual formats your parser doesn't recognize (varies wildly by industry)
  • Poor quality scans where OCR fails (5-10% of scanned documents)

Build explicit error handling for these cases. Log what went wrong, set the document aside for manual review, and keep processing the rest.

Our error rate dropped from 8% to 2% once we added proper classification upfront. Those 6 percentage points meant 600 fewer documents per 10,000 requiring manual intervention.

Parallel Processing

Processing PDFs is embarrassingly parallel—each document is independent. This is your friend at scale.

Use multiprocessing to work on multiple documents simultaneously. Most systems can process 4-8 PDFs at once before hitting diminishing returns from CPU or memory constraints.

We run 6 parallel workers on standard cloud VMs, which gives us optimal throughput without overwhelming the system. Going to 8 workers improved speed by only 5% while increasing costs 30%.

Just be careful with:

  • API rate limits if you're using external OCR services (we hit Google's rate limit at 10 requests/second)
  • Memory usage with large files (one 500-page contract brought down our entire pipeline once)
  • Database connection pools (we learned to limit connections per worker)

Quality Control

You can't manually review 10,000 documents, but you can implement automated quality checks:

  • Does the extracted text meet minimum length requirements?
  • Are required fields populated?
  • Do extracted numbers pass sanity checks?
  • How confident was the OCR?

Flag low-quality extractions for human review. We review 5% of processed documents randomly, which catches about 80% of systematic errors.

We also track field-level confidence scores. If invoice total extraction confidence is below 85%, it goes to review. This catches edge cases that would otherwise slip through.

The Document-Specific Approach

Generic solutions struggle with specialized documents. For high-volume document types, build tailored extractors.

We maintain specialized extractors for:

Invoices: Look for standard fields in predictable locations. We've mapped layouts for the top 50 vendors our clients use, which covers 70% of invoice volume. These get 96% extraction accuracy versus 82% for the generic extractor.

Contracts: Focus on key clauses and dates. We use template matching for standard agreements (NDA, MSA, etc.). Custom extraction rules for each contract type improved accuracy from 73% to 91%.

Forms: Detect form fields and extract values. Many forms follow standard layouts. Government forms especially—once you handle one W-9, you can handle most of them.

Invest extraction effort where you have volume. Don't build custom extraction for document types you see once a month.

What Actually Works

After processing millions of PDFs, here's what consistently works:

  1. Start with good documents: Garbage in, garbage out. We helped one client improve their scan quality, which raised extraction accuracy 15% instantly—way more than any algorithm improvement.

  2. Use the right tool for each document type: We maintain 5 different extraction pipelines. It's more complex than one universal solution, but accuracy improved 20% overall.

  3. Implement confidence scoring: Know when to trust automatic extraction and when to escalate. Our confidence scores reduced manual review time by 40%.

  4. Build feedback loops: Track extraction failures and iterate on your rules. We review all flagged documents weekly and update extraction rules monthly.

  5. Cache aggressively: Don't reprocess the same document twice. We cache extraction results and save 15% of processing time on resubmitted documents.

  6. Monitor everything: Track processing time, success rates, and error types. We graph these metrics daily and catch regressions fast.

When to Use AI

Modern AI models can help with the hardest problems:

  • Layout understanding using vision models like GPT-4V or Claude with vision
  • Table extraction from complex documents
  • Information extraction from unstructured text
  • Classification of document types

But AI adds cost and complexity. Use it strategically where rule-based approaches fail.

We use AI for about 15% of our documents—the complex cases where traditional extraction struggles. It costs 10x more per document but handles cases that would otherwise need manual review. The economics work when the alternative is human processing at $3-5 per document.

The Path Forward

Start simple:

  1. Pick one high-value document type
  2. Process a hundred manually to understand patterns
  3. Build extraction rules for the common cases (aim for 70% coverage)
  4. Add error handling for the exceptions
  5. Measure quality on a test set (we use 500 manually-labeled documents)
  6. Scale up gradually

Don't try to handle every edge case on day one. Build a system that works well for 80% of documents, routes 15% to specialists, and fails gracefully on the remaining 5%.

We started with invoices from our client's top 10 vendors. Once that worked at 95% accuracy, we expanded to more vendors incrementally. It took 6 months to reach 90%+ accuracy across all vendors.

Making It Production-Ready

Moving from prototype to production requires:

  • Monitoring: Track success rates, processing times, and error patterns. We alert on any metric that drops 5% week-over-week.
  • Alerting: Get notified when success rates drop. We once caught a vendor format change within hours because our accuracy alerts triggered.
  • Versioning: Track which extraction logic was used for each document. Essential for debugging and reproducing issues.
  • Rollback: Be able to reprocess with updated logic. We reprocess entire batches when we fix major extraction bugs.
  • Audit trails: Know exactly what happened to each document. Required for compliance in many industries.

These operational concerns matter more than extraction accuracy once you're at scale. A 98% accurate system that breaks silently is worse than a 94% accurate system with good monitoring.

The Bottom Line

PDF parsing at scale is about building resilient systems that handle real-world messiness. Perfect extraction on every document isn't realistic. Fast, consistent, good-enough extraction with smart error handling absolutely is.

We hit 90% accuracy across mixed document types, process 50,000 documents per day, and manually review less than 5%. That's the benchmark to aim for.

Focus on the 80% case, handle errors gracefully, and iterate based on real data. That's what scales.