A mid-market financial services firm was drowning in manual document processing. Analysts spent hours extracting account numbers, coverage limits, and rate tables from PDFs, spreadsheets, Word docs, and scanned images. Off-the-shelf OCR tools failed on complex tables and inconsistent field labels, pushing onboarding SLAs at risk.
The Solution
Phase 1: Core Extraction Engine (Weeks 1-4)
Built a serverless document processing API on AWS with a testable service layer
Parsed OCR block graphs to reconstruct key/value pairs across pages
Normalized repeated fields and preserved positional metadata for auditability
def _extract_form_fields(self, organized_blocks: Dict[str, Dict]) -> Dict[str, Any]: form_fields = {} for label_id, label_block in organized_blocks["labels"].items(): label_text = self._resolve_block_text(label_block, organized_blocks["text"]) if not label_text: continue value_text = self._find_linked_value(label_block, organized_blocks["values"], organized_blocks["text"]) if value_text: field_meta = self._build_field_metadata(value_text.strip(), label_block) self._merge_field_occurrence(form_fields, label_text.strip(), field_meta) return form_fields
Phase 2: Office Document Support (Weeks 5-6)
Added a headless LibreOffice Lambda Layer to convert Excel/Word to PDF
Standardized the pipeline regardless of source format
@staticmethoddef convert_to_pdf(document_bytes: bytes, source_format: str) -> bytes: work_dir = tempfile.mkdtemp() input_path = os.path.join(work_dir, f"source.{source_format}") with open(input_path, "wb") as f: f.write(document_bytes) result = subprocess.run( [OFFICE_BINARY, "--headless", "--invisible", "--convert-to", "pdf", "--outdir", work_dir, input_path], capture_output=True, timeout=90, env={**os.environ, "HOME": "/tmp"}, ) if result.returncode != 0: raise ConversionError(f"Failed: {result.stderr}") with open(os.path.join(work_dir, "source.pdf"), "rb") as f: return f.read()
Phase 3: Performance Optimization (Weeks 7-8)
Content-addressable cache (SHA256 of document + query params) in DynamoDB
Hybrid sync/async flow: small docs synchronous; large docs return job-id + poll endpoint to avoid timeouts