Troubleshooting Document Extraction

Common issues and fixes for document extraction accuracy, ERP matching confidence, and performance.

Low extraction accuracy

Run the Improve step to invoke the GEPA learning loop. It rewrites the per-field prompts based on the corrections you saved as ground truth. Two or three improvement iterations typically push accuracy above 95 percent on consistent document layouts.

If accuracy remains low after three iterations, the issue is usually one of three things. First, the field description is too vague — the model cannot reliably locate the right value because the description does not distinguish it from similar fields on the page. Rewrite the description to be more specific about where the value appears and what it looks like. Second, the document layout varies significantly between documents in the same batch — if suppliers use different column arrangements for the same field, create separate schemas for each layout variant. Third, the source document quality is low — see the PDF Quality Issues section for how scan resolution affects accuracy.

If a specific field is consistently wrong while others are correct, use the conversational copilot to diagnose it. Describe what went wrong and ask for a revised prompt suggestion.

Matching confidence too low

Low matching confidence means the extracted text does not closely match any record in your ERP catalog. The most common causes and fixes:

Catalog not synced. If you have added new products to your ERP recently, sync the catalog from the Integrations page. Matching against a stale catalog will miss records that have been added since the last sync.

SKU format differences. The extracted SKU might be formatted as WIDGET-500-BL while the ERP catalog has WIDGET500BL. Add an alias for the variant in the matching config to bridge the gap without modifying the ERP record.

Fuzzy match threshold set too high. The default confidence threshold for automatic acceptance is 0.85. If matches that look obviously correct are being flagged for manual review, lower the threshold to 0.75 and review the results over a sample batch.

Product descriptions that are too generic. If multiple catalog entries have similar descriptions, the matcher may assign low confidence because it cannot pick the right one. Add distinguishing attributes — size, colour, material code — to the catalog descriptions to sharpen the match.

Slow extraction

Extraction time scales with the number of fields in the schema and the number of pages in the document. A ten-field schema on a two-page invoice typically completes in 20 to 30 seconds. A thirty-field schema on a ten-page document may take two to three minutes.

To reduce extraction time, trim fields from the schema that you do not actively use downstream. Every field triggers a separate extraction call — unused fields add latency without adding value. If you added exploratory fields early in the workflow and no longer review them, remove them from the schema.

Line item extraction is the most time-intensive step. If the document contains a large table with many rows, line item extraction can take 30 to 60 seconds on its own. If you only need the header fields — invoice number, total, vendor name — disable line item extraction for that schema and retrieve only the header data.

PDF quality issues

Some extraction errors are caused by PDF quality rather than schema or prompt issues. Signs of a quality problem: the bounding box highlights a completely wrong region of the document, numeric fields consistently contain OCR misreads (0 read as O, 1 read as l), or the classifier incorrectly identifies the document type.

For scanned documents, the minimum recommended scan resolution is 200 DPI. At lower resolutions, OCR errors become frequent enough to affect extraction quality even after extensive prompt tuning. If possible, request higher-quality scans from the document source — the accuracy gain from a better scan is faster to achieve than the gain from additional improvement cycles.

For Born-Digital PDFs that are showing unexpected errors, check whether the PDF has text extraction restrictions or is security-encrypted. Some PDF generators produce documents where the visual content is a bitmap layer rather than selectable text — these are treated as scanned images and require the OCR path. You can verify by attempting to select text in the PDF with a standard viewer. If text cannot be selected, the document will be processed as a scan.

If a specific document consistently produces errors despite good scan quality, contact support with the dataset ID. We can inspect the raw OCR output to determine whether the problem is at the OCR layer or the extraction layer.

Schema conflicts

A schema conflict occurs when two fields in the same schema target similar or overlapping regions of the document, causing the model to confuse them. Common examples: invoice_date and due_date both returning the same date, or subtotal and total both extracting the same amount.

To fix a conflict, open the Schema panel and review the descriptions of the conflicting fields side by side. Make each description explicitly exclude the other field: "The invoice issue date, not the due date or the service period start date." "The invoice total including all taxes and fees — not the subtotal, net amount, or any individual line item total."

The conversational copilot can help resolve conflicts quickly. Describe the confusion — "invoice_date and due_date are both returning the same value" — and ask it to suggest distinguishing descriptions for both fields. The copilot will analyse the typical layout patterns for those field types and generate descriptions that separate them.

Contacting support

If you have worked through the troubleshooting steps above and are still seeing issues, contact us at contact@dataextractor.io.

Include the following in your message: the dataset ID (visible in the URL of the extraction session), the field or fields that are incorrect, a description of what the extracted value is versus what it should be, and the document type and supplier name if relevant. Screenshots of the Review & Edit panel with the incorrect field highlighted are also helpful.

For issues affecting multiple documents or an entire workflow, include a sample dataset ID from each affected document type. This allows us to inspect the extraction traces and identify whether the root cause is in the OCR layer, the extraction model, or the prompt configuration.

Enterprise customers have access to a dedicated support channel with a four-hour response SLA for production issues.