Financial Services

Financial Document Processing Platform

Hours to seconds (~99% faster)

Avg. processing time

<3% (80% reduction)

Extraction error rate

30x increase

Monthly throughput

$7 to $0.03

Cost per document

The Challenge

A mid-market financial services firm was drowning in manual document processing. Analysts spent hours extracting account numbers, coverage limits, and rate tables from PDFs, spreadsheets, Word docs, and scanned images. Off-the-shelf OCR tools failed on complex tables and inconsistent field labels, pushing onboarding SLAs at risk.

The Solution

Phase 1: Core Extraction Engine (Weeks 1-4)

Built a serverless document processing API on AWS with a testable service layer
Parsed OCR block graphs to reconstruct key/value pairs across pages
Normalized repeated fields and preserved positional metadata for auditability

def _extract_form_fields(self, organized_blocks: Dict[str, Dict]) -> Dict[str, Any]:
    form_fields = {}
    for label_id, label_block in organized_blocks["labels"].items():
        label_text = self._resolve_block_text(label_block, organized_blocks["text"])
        if not label_text:
            continue
        value_text = self._find_linked_value(label_block, organized_blocks["values"], organized_blocks["text"])
        if value_text:
            field_meta = self._build_field_metadata(value_text.strip(), label_block)
            self._merge_field_occurrence(form_fields, label_text.strip(), field_meta)
    return form_fields

Phase 2: Office Document Support (Weeks 5-6)

Added a headless LibreOffice Lambda Layer to convert Excel/Word to PDF
Standardized the pipeline regardless of source format

@staticmethod
def convert_to_pdf(document_bytes: bytes, source_format: str) -> bytes:
    work_dir = tempfile.mkdtemp()
    input_path = os.path.join(work_dir, f"source.{source_format}")
    with open(input_path, "wb") as f:
        f.write(document_bytes)
    result = subprocess.run(
        [OFFICE_BINARY, "--headless", "--invisible", "--convert-to", "pdf", "--outdir", work_dir, input_path],
        capture_output=True,
        timeout=90,
        env={**os.environ, "HOME": "/tmp"},
    )
    if result.returncode != 0:
        raise ConversionError(f"Failed: {result.stderr}")
    with open(os.path.join(work_dir, "source.pdf"), "rb") as f:
        return f.read()

Phase 3: Performance Optimization (Weeks 7-8)

Content-addressable cache (SHA256 of document + query params) in DynamoDB
Hybrid sync/async flow: small docs synchronous; large docs return job-id + poll endpoint to avoid timeouts

if self._requires_async_processing(document_data):
    job_id = ocr_service.start_async_job(document_data)
    return success_response({"job_id": job_id, "status": "PROCESSING", "poll_endpoint": "/job-status"})
ocr_result = ocr_service.analyze_document(document_data)

Phase 4: Fuzzy Field Matching (Weeks 9-10)

Tolerant field matching (case-insensitive, partial matches) across forms and tables

def _is_field_match(self, candidate: str, target: str) -> bool:
    candidate_normalized = candidate.lower().strip()
    target_normalized = target.lower().strip()
    return candidate_normalized == target_normalized or target_normalized in candidate_normalized

Technical Decisions

Decision	Rationale
Serverless over containers	Burst traffic; pay-per-invocation economics
IaC with SAM	Fast iteration during POC; native AWS integrations
DynamoDB over Redis	TTL cleanup, pay-per-request, zero ops
LibreOffice layer	Reliable headless conversion for Office formats
Hybrid sync/async	Sub-3s for small docs; async polling for large to avoid timeouts

Outcome

Avg. processing time: 3-5 hours → 8 seconds (~99% reduction)
Error rate: ~15% → <3% (80% reduction)
Monthly throughput: ~400 docs → ~12,000 docs (30x increase)
Cost per document: ~$7 labor → ~$0.03 compute (99% reduction)
Adopted by 2 additional business units within 90 days; cache hit rate stabilized at 28%

Stack

Compute: AWS Lambda (Python 3.11), Lambda Layers
AI/ML: AWS Textract (Forms + Tables)
Storage: S3, DynamoDB
API: API Gateway, CloudFormation/SAM
Patterns: Service layer architecture, dependency injection, async job polling, content-addressable caching