Stop paying humans to retype PDFs into your ERP.
We build production document extraction pipelines that pull structured data from invoices, contracts, claims, and forms — then post it directly into NetSuite, SAP, Epic, Salesforce, or your DMS with confidence-scored human review on the exceptions.
Most document AI projects stall at 80% accuracy and never make it to production.
The demo looks great. A vendor drops a PDF into a chat window, an LLM returns clean JSON, everyone nods. Then reality hits: scanned faxes from a 1998 multifunction printer, contracts with rider amendments stapled on, invoices in seven currencies, claims forms where the patient wrote outside the box. The pilot accuracy that hit 95% on cherry-picked samples drops to 78% on the actual mailroom feed. Nobody trusts the output. Data entry staff keep keying everything in parallel "just to be safe." Six months in, the project is shelf-ware and the AP team is still using the temp agency.
- ▸ Pilots tested on clean samples that don't reflect the messy reality of scanned, faxed, or photographed documents.
- ▸ No per-field confidence scoring, so a 90% accurate model gets treated as 100% trustworthy or 0% trustworthy.
- ▸ Extracted JSON dumped into a CSV instead of posted directly into NetSuite, SAP, Epic, or the DMS that actually matters.
- ▸ No human-in-the-loop UI, so corrections happen in spreadsheets and never feed back into improving the model.
Treat document AI like an OCR pipeline with a language model bolted on — not magic.
- STEP-01
Sample the document corpus first
Before writing a line of code, we pull 200-500 real documents across vendors, formats, and edge cases. Scanned PDFs, photos taken at angles, multi-page contracts, forms with handwriting. The corpus determines whether you need Textract, Azure Document Intelligence, or a vision LLM — not the other way around.
- STEP-02
Extract with a typed schema
Every document type gets a Pydantic or Zod schema. The LLM returns structured JSON validated against that schema, with field-level confidence scores. No free-text outputs into your ERP. If the model can't produce valid schema after two retries, it routes to human review automatically.
- STEP-03
Set per-field confidence thresholds
Invoice total at 99.5% confidence auto-posts. Vendor name at 95% auto-posts. Anything below routes to a reviewer queue with the source document highlighted. We tune thresholds per field based on the cost of a wrong value — a misread PO number is cheaper to fix than a misread payment amount.
- STEP-04
Wire into the system of record
Extracted data lands in NetSuite, SAP, Epic, Salesforce, or your DMS through their actual APIs — not CSV drops. We handle idempotency, duplicate detection, and rollback. Every posted record carries a link back to the source document and the extraction run that produced it.
- STEP-05
Close the loop with reviewer corrections
When a human corrects a field, that correction becomes labeled training data. We track per-field accuracy weekly and retrain prompts, few-shot examples, or fine-tunes when drift appears. After 90 days most pipelines need 60-80% less human review than at launch.
from pydantic import BaseModel, Field
from typing import Literal
import anthropic
class InvoiceLineItem(BaseModel):
description: str
quantity: float
unit_price: float
total: float
confidence: float = Field(ge=0, le=1)
class Invoice(BaseModel):
vendor_name: str
vendor_tax_id: str | None
invoice_number: str
invoice_date: str # ISO 8601
due_date: str | None
currency: Literal["USD", "EUR", "GBP", "CAD"]
subtotal: float
tax: float
total: float
line_items: list[InvoiceLineItem]
field_confidence: dict[str, float]
def extract(pdf_bytes: bytes) -> Invoice:
client = anthropic.Anthropic()
resp = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=4096,
tools=[{"name": "submit_invoice", "input_schema": Invoice.model_json_schema()}],
tool_choice={"type": "tool", "name": "submit_invoice"},
messages=[{"role": "user", "content": [
{"type": "document", "source": {"type": "base64", "media_type": "application/pdf", "data": pdf_bytes}},
{"type": "text", "text": "Extract all fields. Set confidence below 0.9 for anything ambiguous."}
]}]
)
return Invoice.model_validate(resp.content[0].input) Forcing the model into a typed tool call eliminates hallucinated fields and makes downstream ERP integration deterministic.
Field FAQ.
→ What accuracy can we realistically expect on invoice or contract extraction?
For clean digital PDFs of standardized forms — invoices, W-9s, ACORD certificates — we typically hit 97-99% field-level accuracy after tuning. Scanned documents, handwriting, or non-standard contracts run 88-95%. The honest answer is that accuracy is a per-field number, not a per-document number. Vendor name might be 99.5% while line-item descriptions sit at 92%. We measure each field separately and set thresholds accordingly.
→ Do we still need humans in the loop, or does AI replace the data entry team?
You still need humans, but for exceptions only. A well-tuned pipeline auto-processes 70-85% of documents end-to-end after the first few months. The remaining 15-30% route to a reviewer UI showing the source document with low-confidence fields highlighted. Reviewers correct, approve, and move on — a job that takes seconds instead of minutes per document. Headcount usually shifts from data entry to exception handling and audit.
→ How does this integrate with our existing DMS, ERP, or claims system?
Through whatever interface the system actually supports — REST APIs for NetSuite, SAP, Workday, Salesforce, Epic; SOAP for older systems; SFTP drops for legacy mainframe gateways. We've also pushed into SharePoint, Box, iManage, and OpenText DMS platforms. The extraction service is decoupled from the system of record, so swapping the destination later doesn't require rebuilding the pipeline. Idempotency keys prevent duplicate postings during retries.
→ Which AI model do you use — Claude, GPT, or something else?
It depends on the document. Claude Sonnet 4.5 and GPT-4.1 both handle vision well and are roughly comparable on structured extraction. For high-volume commodity documents we sometimes use Azure Document Intelligence or AWS Textract as a first pass, then an LLM only for the fields the OCR service can't handle. For sensitive federal work we deploy on AWS GovCloud or Azure Government using models with appropriate authorizations.
→ Can this run in a FedRAMP or IL4/IL5 environment for federal contracts?
Yes. As an SDVOSB we deploy in AWS GovCloud, Azure Government, and on-prem when required. Bedrock and Azure OpenAI both offer models in FedRAMP High boundaries. For IL5 workloads we typically use Azure Government Secret or self-hosted open-weight models like Llama or Mistral on accredited infrastructure. We've supported document workflows under DFARS 252.204-7012 and CMMC requirements.
→ How long does a typical pilot take from kickoff to production?
A focused pilot on one document type runs 4-8 weeks. Week 1-2 is corpus collection and schema design. Week 3-5 is pipeline build, prompt tuning, and reviewer UI. Week 6-8 is integration testing with the target system and shadow-mode validation against your current process. We don't cut over until extraction accuracy and integration reliability are measured on real volume, not synthetic test data.
→ What does this cost to run per document at scale?
API costs for vision LLMs run roughly $0.005-$0.05 per page depending on model and document length. At 100,000 documents per month that's $500-$5,000 in inference. Add infrastructure, monitoring, and storage and you're typically at $0.02-$0.10 fully loaded per document. Compare that to $1-$5 per document for offshore manual entry, and the payback on the build is usually under a year.
→ How do you handle PII, PHI, or contract confidentiality?
Documents never train foundation models — we use API endpoints with zero-retention agreements (Anthropic, OpenAI, Bedrock, Azure OpenAI all offer this). For PHI we operate under signed BAAs. Documents at rest are encrypted with customer-managed KMS keys, and we support VPC-only deployments where the LLM call goes through a private endpoint. Audit logs capture every extraction, every reviewer action, and every system-of-record write.
→ What happens when the model is wrong on something important?
Three things. First, confidence thresholds catch most errors before they post — low-confidence fields go to human review. Second, every posted record links back to the source document and extraction run, so corrections are one click. Third, we run weekly accuracy audits sampling production output against ground truth. When a field's accuracy drifts, we get an alert and tune before it becomes a business problem.
Continue recon.
AI Integration Services
How we embed Claude, GPT, and RAG into operational workflows that have to actually work.
REL-02Document AI Case Studies
Real extraction pipelines we've shipped, with accuracy numbers and integration details.
REL-03Pilot Packages
Fixed-scope 4-8 week pilots on a single document type, with production-ready output.
REL-04Scope a Pilot
Send us a sample of your documents and target system. We'll tell you what's feasible.
Have a stack of PDFs nobody wants to key in? Let's scope a pilot.
Talk to a VooStack operator. We respond within one business day.