Getting Started with Document Extraction

Quick start guide to extracting structured data from PDFs, invoices, and purchase orders with dataextractor.io.

Before You Start

dataextractor.io works with any Born-Digital PDF or scanned image that contains structured data — invoices, purchase orders, receipts, statements, customs declarations, or any other document where you need specific values extracted into fields.

No training data is required before your first extraction. The schema detection step automatically identifies what fields are likely to exist in the document. You review and approve the proposed schema before extraction runs, so you are always in control of what gets extracted.

The Free plan allows 10 documents per month with the full extraction pipeline. No credit card is required to sign up. If you are evaluating the platform for a team, use the same account for your first batch of documents — you can wire up API access once you have validated extraction quality on your specific document types.

Sign up & login

Create a free account at dataextractor.io. The Free plan gives you 10 document extractions per month and access to the full extraction pipeline — schema detection, per-field extraction, bounding box visualisation, the GEPA learning loop, and ERP matching.

No credit card is required. Sign up with an email and password, or continue with Google. Once logged in, you land on the Extractor page — the main interface for uploading documents and managing extraction sessions.

If your organisation uses SSO or you need multiple team members to share a workspace, contact us at contact@dataextractor.io to discuss Enterprise options.

Upload your first document

From the Extractor page, click Upload Document and select a PDF from your file system. Supported formats are PDF (Born-Digital and scanned), JPEG, and PNG. Files up to 25MB are accepted. Multi-page documents are supported — all pages are processed as a single extraction session.

Alternatively, paste a URL to a publicly accessible PDF in the URL field. This is useful for testing with sample documents or for automating ingestion from document management systems that expose public links.

For best results on your first extraction, use a representative document from your most common supplier or workflow. The schema detection step will propose fields based on this document, and those fields are reused for subsequent documents of the same type.

AI detects the schema

After upload, the classifier analyses the document layout and identifies the document type. For an invoice, it proposes fields like vendor name, invoice number, invoice date, due date, line items, subtotal, tax amount, and total. For a purchase order, it proposes PO number, buyer entity, supplier entity, delivery address, and line item fields.

The proposed schema is displayed on the Review Schema panel. Each field has a name, type (text, number, currency, date, or line_item), and a description. Review the suggested fields carefully: approve the ones that look correct, edit any that need renaming or type changes, and remove fields you do not need.

You can also add fields that the classifier did not detect. Click Add Field and describe what you want — for example, 'payment terms' or 'VAT registration number'. The more descriptive you make the field name and description, the more accurately the extraction prompt will locate the right value.

Extract values

Once you have approved the schema, click Start Extraction. Each field is extracted with a dedicated per-field prompt — the model reads the entire document with the specific goal of finding the value for that field. Results typically arrive within 30 to 60 seconds for a standard invoice.

Extracted values appear in the Review & Edit panel alongside the source PDF with bounding boxes highlighting where each value was found on the page. Review each field value against the highlighted region. If a value is wrong — wrong number, truncated text, wrong date format — click on the field and type the correct value.

Save your corrections as ground truth by clicking Save Ground Truth. This locked state is what the GEPA learning loop uses to evaluate extraction quality and generate improved prompts.

Improve with the GEPA loop

After saving ground truth, click Improve to run the GEPA learning loop. The loop compares your saved corrections against the extraction output, identifies which fields had errors and what kind of errors they were, and rewrites the per-field prompts to fix those specific failure patterns.

The improved prompts are saved automatically. Click Extract again to see the re-extraction using the new prompts — most fields that were corrected will now be right. Save the new results as ground truth and run Improve again if there are remaining errors. Two to three iterations typically push accuracy above 95 percent on consistent document layouts.

The prompts that GEPA generates are not opaque — you can review them in the Schema panel and edit them manually if you want to fine-tune them further. This transparency is intentional: the learning loop is a productivity tool, not a black box.

What to expect on first run

The first extraction on a document type you have not seen before will rarely be perfect. Expect to correct three to five fields on the first pass — typically numeric formatting, date formats, and any fields with unusual placement on the page.

After one round of corrections and one Improve cycle, accuracy typically rises above 90 percent on the same document layout. After two rounds, most teams reach 95 to 99 percent on consistent layouts from the same supplier.

Results vary with document quality. Born-Digital PDFs — generated by accounting software or ERP systems — produce the best results. Scanned documents introduce OCR noise that can affect accuracy on lower-quality scans. If you are seeing consistent errors on scanned documents, ensure the source scan is at least 200 DPI before investing time in prompt tuning.

Next steps

Once extraction is working well for your primary document type, there are a few natural next steps.

Connect a downstream system. The REST API extracts a document and returns structured JSON — synchronously, or via an async job you poll — so you can write the result straight into your ERP, database, or data lake. See the API Reference and Integrations Guide for details.

Expand to additional document types. Each document type gets its own schema. Start a new extraction session with a different document — a purchase order if you started with invoices, or a different supplier's invoice format — and repeat the schema review and improvement cycle.

Invite team members. Multiple users can work in the same account, reviewing and correcting extractions. Shared corrections feed into the same GEPA loop, so the system improves faster when more reviewers are contributing ground truth.