productionpythonaievaluation

SmartInvoiceExtractor

Regex-first PDF extraction pipeline with Gemini fallback for semi-structured invoices.

Status
Shipped · Solsten 2025
Role
Software Engineering Intern
Timeline
Jul–Aug 2025
Stack
Python · pdfplumber · Gemini 2.5 Flash · Vertex AI · pytest
architecture diagramfig. 01

01Problem

Medical facilities generate large volumes of handwritten and printed receipts and invoices across many document layouts. Manual data entry is slow and inconsistent — a bottleneck for billing, compliance, and downstream analytics.

The extraction problem is semi-structured: the same field (procedure code, invoice total, vendor ID) appears in different positions and formats depending on the document source. A pure LLM approach handles the variation but is too expensive at volume. Pure regex is cheap but brittle on unseen layouts. The goal was a hybrid that keeps LLM calls rare without sacrificing coverage.

02Core idea

A fast-path / slow-path architecture. pdfplumber parses each PDF into a text layer with layout geometry. A regex layer runs first, targeting high-confidence fields using field-specific pattern sets. Fields that clear a confidence threshold go directly to JSON output.

Fields that score below threshold, or are absent, route to a Gemini 2.5 Flash fallback. A critical-field quality gate enforces this: if any designated high-priority field is missing from the regex output, the document is escalated to the fallback regardless of overall confidence. LLM output is validated against the field schema before merging.

03pdfplumber parsing

pdfplumber extracts text with positional geometry — character bounding boxes, line groupings, table regions. This layout data lets the regex layer position fields relative to known anchors: totals boxes, header regions, line-item columns.

For documents where text is consistently placed (printed invoices), positional parsing significantly improves extraction precision over raw text dumps. For handwritten or heavily varied layouts, the positional data is less reliable and the fallback does more work.

04Regex layer and confidence scoring

The regex layer applies a tiered confidence scorer per field: exact pattern match, partial match with post-processing, and a below-threshold fallback marker. Each field type has its own pattern set — currency-aware patterns for amounts, multi-locale formats for dates, code-registry formats for procedure and diagnosis IDs.

Pattern sets are composable: a field can have multiple patterns targeting different document styles, with the highest-confidence match winning. This allows incremental coverage expansion without modifying the core extractor.

05JSON normalisation

Regex and Gemini outputs are merged and normalised to a canonical field schema before the record exits the pipeline. Field names are standardised, value formats are coerced (e.g. dates to ISO 8601, amounts to decimal strings), and null vs absent fields are distinguished.

Normalisation ensures that downstream consumers see a consistent schema regardless of which extraction path fired. It also makes schema violations from the Gemini fallback easy to detect — a malformed value that passes regex validation would fail normalisation.

06Regex suggestion loop

When the Gemini fallback fires for a field that should have been extractable by regex, the pipeline logs the miss: the document text segment, the field type, and the recovered value. A post-run analysis step groups misses by field type and document format, surfacing candidate regex patterns that would have matched.

This loop turns fallback events into pattern improvement signals. High-fallback-rate field types surface as improvement targets. Pattern candidates can be reviewed and merged into the regex layer — gradually extending fast-path coverage without a manual audit of raw documents.

07Tradeoffs

Regex is brittle on layouts the patterns haven't seen. Any document format outside the existing pattern set will under-extract and push to the LLM fallback. The suggestion loop addresses this over time, but cold-start coverage for a new document type requires initial pattern development.

The Gemini fallback uses structured output constraints to reduce hallucination risk, but semantic errors — a plausible-looking but incorrect code — require ground-truth comparison to detect. Field-level validation catches type and format violations; it can't catch logically incorrect values.

pdfplumber assumes text is embedded in the PDF. Scanned documents and image-only PDFs without an OCR layer produce sparse or empty text extractions. The pipeline does not include OCR — it requires digitally generated or OCR-preprocessed inputs.

08Limitations

Public repo vs. production deployment. The public repo is an earlier local prototype — it demonstrates the core regex-first / Gemini-fallback architecture and JSON output structure. The production deployment (Solsten internship, 2025) extended this with the pdfplumber integration, quality gate, normalisation layer, and suggestion loop. Production metrics should not be attributed to the public prototype.

OCR is out of scope. Image-only PDFs or scans without an embedded text layer are not handled. A separate OCR preprocessing step is required before ingestion.

Schema coverage. The field schema and pattern sets are tailored to the document formats encountered at Solsten. Adapting to a new domain requires schema extension and pattern development.

09Outcomes

The following metrics are from the proprietary Solsten production deployment and should not be attributed to the public prototype:

≈98.2% field-level accuracy across document types. ≈65% lower inference cost versus a pure-LLM baseline (fallback call rate held to ~35% of documents). ≈80% reduction in manual correction workload.

10What I learned

Key insight

The interesting work in production extraction pipelines is usually the quality gate and the evaluation loop, not the model call. Knowing when to route to the LLM — and what to learn from each fallback event — matters more than the fallback itself.

Fast-path / slow-path is the correct architecture for cost-sensitive LLM pipelines. The hybrid is not a compromise; it is a deliberate design choice with explicit cost and quality implications at each stage.