Learn how to use Extractors
Learn about different extraction tools and services available to be used while configuring a field/fieldset.
Extractors are the main block of the extraction pipeline and so, does the usual heavy-lifting. These let the user define how to extract a certain type of information, i. e. through machine learning (requires examples to train a model on) or through rule based approaches (do not need any examples and work right away).
Rule Based Extractors for fields
|
Name of extractor
|
Description
|
|
regex
|
Search for a string, where the defined regular expression fits.
If more than one strings fit, to the regex, the system takes the first one.
|
|
barcodes
|
This extractor can be used to filter input barcodes with regex patterns, types and page number.
|
|
key-textblock
|
Extraction of a continuous text block relative to a key token.
|
|
key-region
|
Extract information from a specified region relative to a key token. Vector is top left corner of key to top left corner of region.
|
|
fixed_region
|
Extraction of a value at a fixed region (only depending on the coordinates). In most of the cases a verifier or a transformer is needed.
|
|
key-value-pair
|
Definition of a keyword and a direction (left, right, down, up) where the value is placed relatively to the keyword.
|
|
currency
|
Find the most common currency on the document.
|
|
extract-nothing
|
It is a placeholder for an extractor, because a field needs an extractor. Nothing is read in the process. The field exists exclusively for pure manual validation.
|
|
checkbox
|
Extraction of the checkboxes, currently not for public use.
|
Machine Learning (ML) Based Extractors for field sets
|
Name of extractor
|
Description
|
|
gnn-table-extractor
|
Best suited for: structured tables with a clear grid layout.
|
|
gnn-address-extractor
|
Best suited for: Address blocks.
|
|
gnn-line-item-extractor
|
Best suited for: Line-item layouts with repeating patterns.
|
|
gnn-repeatable-fieldset-extractor
|
Best suited for: Repeated groups of fields without a strict structure.
|
|
gnn-fieldset-extractor
|
Best suited for: Logical groups of related fields.
|
|
gnn-forms-extractors
|
Best suited for: Form-based documents.
|
| gnn-field-extractor-allowed-values |
Best suited for: Fields with predefined or controlled values.
Users define a custom list of allowed labels during configuration.
The model learns to assign those labels to the correct tokens. |
Machine Learning (ML) Based Extractors for fields
|
Name of extractor
|
Description
|
|
gnn-field-extractor
|
Best suited for: Extraction of single values or regions, with or without keywords.
|
Extractor "fixed-region"
Definition of the region, where the result should be extracted. The whole text in the region is the result, so that in most cases a transformer is necessary to extract the correct data. The values top, left, right, bottom define the limits of the rectangle.
|
Parameter
|
Description
|
|
Pattern
|
Search pattern (regex)
|
|
Top
|
top coordinate of the rectangle in percent (0 <= percent <= 100)
|
|
Left
|
left coordinate of the rectangle in percent (0 <= percent <= 100)
|
|
Bottom
|
bottom coordinate of the rectangle in percent (0 <= percent <= 100)
|
|
Right
|
right coordinate of the rectangle in percent (0 <= percent <= 100)
|
Extractor "key-region"
Definition of a keyword, distance Vector (horizontal and vertical distance) and a Region Size (textbox width and height where the value is placed relatively to the keyword.
As keyword (pattern) a regular expression can be defined. If multiple patterns are configured, the most top expression has the highest priority, in case that two or more patterns match. Only one Vector and Region Size can be provided.
|
Parameter
|
Description
|
|
Pattern
|
Search pattern (regex)
|
|
Fuzzy
|
[default=disabled] Enables fuzzy regex with a max edit distance of 2
|
|
X
|
Top-left to top-left horizontal distance in fractional (0-1) page coordinates
|
|
Y
|
Top-left to top-left vertical distance in fractional (0-1) page coordinates
|
|
WIDTH
|
Width of the textbox in fractional (0-1) page coordinates
|
|
HEIGHT
|
Width of the textbox in fractional (0-1) page coordinates
|
Extractor "checkbox"
To extract checkboxes, first of all a keyword to extract a group of checkboxes is needed. According to this keyword the vectors to the checkboxes are stored.

At the moment the checkbox-extractor can't be configured comfortable via GUI, only using input parameters (in JSON format).
{
"train_data": {
"scale": false,
"vectors": [
[0.1755, 0.0061], [0.1755,0.0418], [0.1755, 0.0775], [0.1755, 0.1132]
],
"key_text": "ProductA",
"key_descr": "ProductADesc",
"box_descrs": [
"ProductA", "ProductB", "ProductC", "ProductD"
],
"key_center": [
0.1539, 0.1748, 0
],
"box_regions": [
{ "top": 0.1665, "left": 0.298, "right": 0.3214, "bottom": 0.1822},
{ "top": 0.2125, "left": 0.298, "right": 0.3214, "bottom": 0.2282},
{ "top": 0.2585, "left": 0.298, "right": 0.3214, "bottom": 0.2742},
{ "top": 0.3045, "left": 0.298, "right": 0.3214, "bottom": 0.3202}
],
"vector_mode": "bottom-left"
}
}
