Schema Configuration for Document Extraction

Define custom extraction schemas for any document type — field types, validation rules, and line item extraction.

Field types

Every schema field has a type that tells the extraction model what kind of value to look for and how to normalise the output.

text extracts any string value. Use it for names, addresses, reference numbers, or any field where the raw text on the document is the correct output. number extracts an integer or decimal and normalises formatting by removing thousands separators and converting comma decimals to period decimals. currency extracts a monetary amount with an optional currency code, returned as a float with a separate currency field (e.g., amount: 12847.50, currency: EUR). date extracts a date value and normalises it to ISO 8601 format (YYYY-MM-DD) regardless of how it appears on the document. boolean extracts a true/false value based on the presence or text of an indicator (e.g., Taxable: Yes). enum extracts one of a fixed set of allowed values that you specify in the field definition.

line_item is the special type for tabular data — it defines a sub-schema for the columns of a table. See the Line Item Extraction section for details.

Validation rules

Validation rules run after extraction and flag values that do not meet expected constraints. Flagged values appear as warnings on the Review & Edit screen — they do not block extraction from completing, but they signal fields that need human attention before the data is trusted downstream.

Available validation types: regex requires the extracted value to match a regular expression, which is useful for structured formats like invoice numbers or IBAN codes. min and max set numeric bounds for number and currency fields — use these to catch implausible totals such as a negative invoice amount. required marks a field as mandatory and flags it as an error rather than a warning when the extracted value is empty. custom accepts a JavaScript expression evaluated against the field value, returning true if valid and false if not, which enables cross-field validation (for example, total must equal subtotal plus tax).

Validation rules are defined per field in the schema editor. You can also add them through the conversational copilot: tell it what constraint you need in plain English and it will configure the rule.

Line item extraction

For documents with tabular data — invoice line items, PO lines, receipt rows — define a field with type line_item and a sub-schema listing the columns you want to extract from the table.

A typical invoice line item sub-schema has fields: description (text), quantity (number), unit_price (currency), and line_total (currency). The extraction model identifies the table on the document, parses the rows, and returns one array entry per row with the sub-fields populated.

Line item extraction works best on well-structured tables with consistent column headers. It handles tables that span multiple pages. If column headers are absent — some invoices use positional layout without headers — add the column positions in the field descriptions to guide the model toward the right region of the page.

Writing effective field descriptions

The field description is the most important part of your schema. It is used directly as the instruction to the extraction model — the better the description, the more accurately the model locates the right value.

Effective descriptions are specific about where the value appears and how it is typically labelled: "The total amount due including all taxes and fees, usually found at the bottom right of the invoice, labelled Total, Amount Due, or Bedrag." Vague descriptions like "Total amount" leave the model to guess which of several similar-looking numbers on the page is correct.

When extraction is wrong, the fix is almost always in the description. If the model is extracting the subtotal instead of the total, update the description to explicitly say "the final total after taxes, not the subtotal or net amount." If it is picking up the wrong date, specify which one: "the invoice issue date, not the payment due date or the service period start date."

The conversational copilot can rewrite descriptions for you. Tell it what went wrong — "the quantity field is extracting the unit price instead" — and it will suggest a revised description that disambiguates the two fields.

Line item sub-schemas in practice

The most common challenge with line item extraction is inconsistent table structure across suppliers. One supplier's invoice has Description, Qty, Unit Price, Total columns in that order; another has Item Code, Description, Quantity, VAT Rate, Net, Gross. Define separate schemas for each document type rather than trying to build a single universal schema that covers all variations.

For each column in the table, write a description that identifies the column by its typical header variations. For quantity: "The number of units ordered or delivered. The column header may be Qty, Quantity, Units, or Count."

If a table column is sometimes absent — for example, some invoices include a discount column and others do not — mark the field as optional in the schema. The extraction returns null for optional fields that are not found, rather than failing the extraction or forcing a blank value.

Schema versioning

When you run the GEPA learning loop, the per-field prompts in your schema are rewritten to improve accuracy. Each improvement iteration produces a new schema version. You can review the prompt history for any field to see what changed between versions and understand why the loop made each edit.

If an updated prompt performs worse on a new document layout — a regression — you can roll back individual field prompts to a previous version from the Schema panel. This is useful when a prompt optimised for one supplier's layout starts performing poorly on a different supplier's variation of the same document type.

For production workflows where multiple document types share similar fields (vendor name appears on both invoices and purchase orders), create separate schemas for each document type rather than sharing one schema. Per-type schemas allow the prompts to be fine-tuned to the specific layout conventions of each document type without causing regressions in the other.