Document Extraction API Reference

REST API reference for the dataextractor.io document extraction API. Authentication, datasets, extraction, and webhooks.

Authentication

All API requests require a Bearer token in the Authorization header. Generate an API key from the Developer page in your account settings. Keys are scoped to your account, can have descriptive names for tracking purposes (e.g., production-erp-sync or staging-tests), and can be revoked at any time without affecting other keys.

Include the token in every request using the header: Authorization: Bearer <your-api-key>.

Keys are long-lived by default. Never expose your API key in client-side code or commit it to a public repository. If a key is compromised, revoke it immediately from the Developer page and generate a replacement. For automated workflows that require short-lived tokens or fine-grained scopes, contact us about OAuth client credentials flow available on Enterprise plans.

List datasets

GET /api/v1/datasets — list all datasets in your account. The response is paginated; see the Pagination section for cursor parameters.

Supported query parameters: customer_id (string) to filter by customer, dataset_type (string) to filter by type such as invoice or purchase_order, is_verified (boolean) to return only verified extractions, search (string) for full-text search across extracted field values, limit (integer, default 20, max 100) to control page size, and cursor (string) for the pagination cursor from the previous response.

The response includes a data array of dataset objects and a next_cursor field. When next_cursor is null, you have reached the last page.

Get a dataset

GET /api/v1/datasets/{id} — return a single dataset with all extracted fields, line items, ground truth values, and accuracy scores.

The fields array contains each extracted field with name, value, type, confidence score, and bounding_box coordinates. The ground_truth array contains the human-verified correct values saved after review. The accuracy object exposes per-field match scores and an overall_accuracy float from 0.0 to 1.0.

This endpoint is the standard way to retrieve extraction results when you are not using webhooks. Trigger an extraction, store the returned dataset ID, then poll this endpoint every few seconds until the status field transitions from processing to complete.

Webhooks

Subscribe to real-time events by configuring a webhook URL from the Integrations page in your account. Supported event types are extraction.completed, matching.completed, and extraction.failed.

Webhook payloads are signed with HMAC-SHA256. Every request includes an X-Dataextractor-Signature header containing the hex-encoded HMAC of the raw request body, computed using your webhook secret. Always verify this signature before processing the payload — reject any request where the signature does not match.

Webhook delivery is retried up to five times with exponential backoff if your endpoint returns a non-2xx status or times out after 10 seconds. The payload includes an event_id field — use it as an idempotency key to handle retries safely without processing the same event twice.

Rate limits & quotas

API rate limits are enforced per API key. The default limit is 60 requests per minute for list and get endpoints, and 10 requests per minute for extraction endpoints (upload and extract operations). Exceeding these limits returns a 429 Too Many Requests response with a Retry-After header indicating how many seconds to wait before retrying.

Extraction quotas are tied to your plan. The Free plan allows 10 extractions per month. The Pro plan allows 500. Enterprise accounts have volume-based quotas negotiated separately.

If you are building a batch processing pipeline, spread requests evenly over time rather than sending them in bursts. The API supports concurrent requests up to the rate limit, but sustained burst traffic above the limit will result in throttling that slows your throughput overall.

Error codes reference

The API uses standard HTTP status codes. The most common responses:

400 Bad Request — the request body is malformed or missing required fields. The response body includes a detail field with a specific description of the validation failure.

401 Unauthorized — the Authorization header is missing, or the API key is invalid or revoked.

404 Not Found — the requested dataset ID does not exist in your account.

422 Unprocessable Entity — the uploaded file is not a supported format, is corrupted, or exceeds the 25MB size limit.

429 Too Many Requests — rate limit exceeded. Check the Retry-After header and back off accordingly.

500 Internal Server Error — an unexpected server-side error. Retry with exponential backoff. If the error persists, contact support with the request ID from the X-Request-Id response header.

Pagination

List endpoints use cursor-based pagination. The initial request returns the first page of results and a next_cursor value in the response. Pass that value as the cursor query parameter on the next request to retrieve the following page. When next_cursor is null in the response, you have fetched all results.

Cursor tokens are opaque strings — do not attempt to parse or construct them manually. They are valid for 24 hours after the initial request. If a cursor expires, restart pagination from the beginning without a cursor parameter.

For large result sets, use the search and filter parameters to narrow the scope before paginating. The limit parameter controls page size up to a maximum of 100 results per page, which reduces the number of round trips for large datasets.