Extraction Workflows & Best Practices

Best practices for the document extraction workflow: upload, detection, schema review, and extraction.

Upload step

Group similar documents into a single workflow — same supplier, same document type, same approximate layout. Schemas trained on one batch carry over automatically to subsequent batches from the same source. The more consistent your input documents, the less prompt-tuning each new batch requires.

If you are processing invoices from multiple suppliers in a single session, expect to review and adjust the schema for each distinct layout. The classifier proposes a schema based on the first document it analyses. Suppliers with non-standard layouts may need a few extra fields added or auto-detected fields removed before extraction produces good results.

For automated ingestion via the API, organise document uploads by customer or supplier ID. This makes it easier to trace extraction results back to the source and apply per-customer schema overrides when needed.

Detection step

AI detection runs the classifier on the uploaded document and proposes a schema. The classifier is trained on a wide range of document types and is accurate for standard invoice, PO, and receipt layouts. It is less reliable for unusual formats — handwritten documents, mixed-language documents, or documents with non-standard column arrangements.

Always review the proposed schema before accepting. Check that field types are correct — a date field should have type date, not text — that required fields are marked required, and that field descriptions are specific enough to distinguish similar fields such as invoice_date versus due_date.

If the classifier proposes fields you do not need, remove them before running extraction. Every field in the schema triggers a separate extraction call — unused fields slow down extraction and add noise to the review step without providing value.

Edit step

The Review Schema panel is where you finalise the extraction blueprint for a document type. Drag fields to reorder them — the order you set here is the order they appear in the Review & Edit panel after extraction, so put the most important fields (invoice number, total, vendor name) at the top where reviewers will see them first.

Mark fields as required if they must be present for the extraction to be usable downstream. Required fields that come back empty are flagged as errors rather than warnings, making them easy to spot during review.

Write field descriptions carefully at this stage — it is the highest-leverage activity in the whole workflow. A precise description that explains where the value appears and how it is typically labelled will produce accurate extraction without needing a correction loop. A vague description will require multiple improvement iterations to converge.

Extract step

Extraction runs each field independently. Fields with simple, well-labelled values (invoice number, vendor name) return in a few seconds. Fields that require the model to reason across the full document (line items in a multi-page invoice) take longer. Results appear as they complete, so you can start reviewing fast fields while the slower ones are still processing.

Bounding boxes are detected in a separate fire-and-forget call that runs after extraction completes. The UI stays responsive during this step. If you open the Review & Edit panel immediately after extraction, some bounding boxes may still be loading — wait a few seconds and they will populate.

Before saving corrections, review every field value against the highlighted region on the PDF. Do not rely solely on the extracted text — the bounding box shows you where the model looked, which can reveal misidentification even when the extracted text looks plausible at first glance.

Batch processing tips

For batch processing — running the same schema against hundreds of documents — the most important thing is front-loading the schema quality work. One well-tuned schema with 95 percent or higher accuracy on a small pilot batch will produce consistent results at scale. Pushing a schema to batch processing at 80 percent accuracy creates a large manual correction backlog that slows the whole pipeline.

Use the API to upload and trigger extractions programmatically. Subscribe to the extraction.completed webhook to receive results as each document finishes rather than polling. Process the webhook payload directly into your ERP or data store to minimise latency.

For high-volume processing, spread uploads evenly across time to stay within API rate limits. The extraction endpoint allows 10 requests per minute on Pro plans. For larger volumes, contact us about Enterprise throughput limits.

Scanned vs digital PDFs

Born-Digital PDFs — documents generated by accounting software, ERP systems, or PDF printers — produce the best extraction results. The text is machine-readable at the PDF layer, so the model can work with it directly without any OCR interpretation step.

Scanned documents require an OCR pass before extraction. The quality of OCR output depends heavily on scan quality. For scanned invoices, aim for a minimum resolution of 200 DPI, with 300 DPI preferred. Low-resolution or skewed scans introduce OCR noise that causes extraction errors independent of the model or prompt quality.

If you are seeing consistent errors on specific fields in scanned documents — date fields with misread digits, or amounts where the digit 0 is read as the letter O — check the source scan resolution before investing time in additional improvement cycles. Improving the scan quality often fixes the error faster than further prompt tuning.

When to re-extract vs improve

Two buttons are available after saving ground truth: Re-extract and Improve. Knowing which to use saves time.

Use Re-extract when you have manually edited field prompts in the Schema panel and want to test the updated instructions without running the full learning loop. Re-extract applies the current prompts as-is — it is a fast test run, typically completing in the same time as the original extraction.

Use Improve when you have saved corrections and want the GEPA loop to automatically rewrite the prompts to fix those errors. Improve is slower than Re-extract because it runs a prompt-refinement step before re-extracting, but it produces better prompts because it analyses the error pattern rather than relying on manual edits.

For a new document type, the typical sequence is: extract, review, correct, save ground truth, Improve. For a known document type where you suspect a prompt regression, the sequence is: re-extract, review, and if still wrong, save ground truth and Improve.