<img height="1" width="1" style="display:none" src="https://www.facebook.com/tr?id=2728387060522524&amp;ev=PageView&amp;noscript=1">
Skip to content
  • There are no suggestions because the search field is empty.

Learn how to use Extractors

Learn about different extraction tools and services available to be used while configuring a field/fieldset.

Extractors are the main block of the extraction pipeline and so, does the usual heavy-lifting. These let the user define how to extract a certain type of information, i. e. through machine learning (requires examples to train a model on) or through rule based approaches (do not need any examples and work right away).
 

Rule Based Extractors for fields

Name of extractor
Description
regex
Search for a string, where the defined regular expression fits.
If more than one strings fit, to the regex, the system takes the first one.
barcodes
This extractor can be used to filter input barcodes with regex patterns, types and page number.
key-textblock
 Extraction of a continuous text block relative to a key token.
key-region
Extract information from a specified region relative to a key token. Vector is top left corner of key to top left corner of region.
fixed_region
Extraction of a value at a fixed region (only depending on the coordinates). In most of the cases a verifier or a transformer is needed.
key-value-pair
Definition of a keyword and a direction (left, right, down, up) where the value is placed relatively to the keyword.
currency
Find the most common currency on the document.
extract-nothing
It is a placeholder for an extractor, because a field needs an extractor. Nothing is read in the process. The field exists exclusively for pure manual validation.
checkbox
Extraction of the checkboxes, currently not for public use. 

Machine Learning (ML) Based Extractors for field sets

Name of extractor
Description
gnn-table-extractor
Best suited for: structured tables with a clear grid layout.
  • Works best with simple, well-aligned tables.
  • Assumes homogeneous columns (each column contains the same type of data).
  • Strong performance on documents with consistent row/column structure.
gnn-address-extractor
Best suited for: Address blocks.
  • Based on the gnn-fieldset-extractor.
  • Pre-trained on various address formats.
  • Handles different address structures and formats.
gnn-line-item-extractor
Best suited for: Line-item layouts with repeating patterns.
  • Designed for repetitive line-item structures.
  • Does not require homogeneous columns.
  • Pre-trained on purchase-to-pay (P2P) style line items.
gnn-repeatable-fieldset-extractor
Best suited for: Repeated groups of fields without a strict structure.
  • No strict layout or pattern required.
  • Fields within a group are expected to be closer to each other than to other groups.
  • Fieldsets can appear side-by-side or on the same line.
gnn-fieldset-extractor
Best suited for: Logical groups of related fields.
  • Extracts fields that belong together conceptually.
  • Focuses on grouping rather than repetition.
gnn-forms-extractors
Best suited for: Form-based documents.
  • Similar to the gnn-fieldset-extractor.
  • Optimized for structured form layouts.
  • Only applied to the first page of documents.
gnn-field-extractor-allowed-values
Best suited for: Fields with predefined or controlled values.
Users define a custom list of allowed labels during configuration.
The model learns to assign those labels to the correct tokens.

Machine Learning (ML) Based Extractors for fields

Name of extractor
Description
gnn-field-extractor
Best suited for: Extraction of single values or regions, with or without keywords.
  • Designed to extract individual fields (not groups or tables).
    Can work with keywords (e.g., labels like “Total”, “Date”) or without keywords.
  • Supports extraction based on:
    Keyword proximity (e.g., value next to “Invoice Total”)
    Positional/region-based detection (fixed or learned locations)

 

Extractor "fixed-region"

Definition of the region, where the result should be extracted. The whole text in the region is the result, so that in most cases a transformer is necessary to extract the correct data. The values top, left, right, bottom define the limits of the rectangle.
 
Parameter
Description
Pattern
Search pattern (regex)
Top
top coordinate of the rectangle in percent (0 <= percent <= 100)
Left
left coordinate of the rectangle in percent (0 <= percent <= 100)
Bottom
bottom coordinate of the rectangle in percent (0 <= percent <= 100)
Right
right coordinate of the rectangle in percent (0 <= percent <= 100)

Extractor "key-region"

Definition of a keyword, distance Vector (horizontal and vertical distance) and a Region Size (textbox width and height where the value is placed relatively to the keyword.
As keyword (pattern) a regular expression can be defined. If multiple patterns are configured, the most top expression has the highest priority, in case that two or more patterns match. Only one Vector and Region Size can be provided.
 
Parameter
Description
Pattern
Search pattern (regex)
Fuzzy
[default=disabled] Enables fuzzy regex with a max edit distance of 2
X
Top-left to top-left horizontal distance in fractional (0-1) page coordinates
Y
Top-left to top-left vertical distance in fractional (0-1) page coordinates
WIDTH
Width of the textbox in fractional (0-1) page coordinates
HEIGHT
Width of the textbox in fractional (0-1) page coordinates
 

Extractor "checkbox"

To extract checkboxes, first of all a keyword to extract a group of checkboxes is needed. According to this keyword the vectors to the checkboxes are stored.
 
 At the moment the checkbox-extractor can't be configured comfortable via GUI, only using input parameters (in JSON format).
{
"train_data": {
"scale": false,
"vectors": [
[0.1755, 0.0061], [0.1755,0.0418], [0.1755, 0.0775], [0.1755, 0.1132]
],
"key_text": "ProductA",
"key_descr": "ProductADesc",
"box_descrs": [
"ProductA", "ProductB", "ProductC", "ProductD"
],
"key_center": [
0.1539, 0.1748, 0
],
"box_regions": [
{ "top": 0.1665, "left": 0.298, "right": 0.3214, "bottom": 0.1822},
{ "top": 0.2125, "left": 0.298, "right": 0.3214, "bottom": 0.2282},
{ "top": 0.2585, "left": 0.298, "right": 0.3214, "bottom": 0.2742},
{ "top": 0.3045, "left": 0.298, "right": 0.3214, "bottom": 0.3202}
],
"vector_mode": "bottom-left"
}
}