<img height="1" width="1" style="display:none" src="https://www.facebook.com/tr?id=2728387060522524&amp;ev=PageView&amp;noscript=1">
Skip to content
  • There are no suggestions because the search field is empty.

Learn how to use Extractors

Learn about different extraction tools and services available to be used while configuring a field/fieldset and how to set them.

Extractors are the main block of the extraction pipeline and so, does the usual heavy-lifting. These let the user define how to extract a certain type of information, i. e. through machine learning (requires examples to train a model on) or through rule based approaches (do not need any examples and work right away).
 

Rule Based Extractors for fields

Name of extractor
Description
regex
Search for a string, where the defined regular expression fits.
If more than one strings fit, to the regex, the system takes the first one.
barcodes
This extractor can be used to filter input barcodes with regex patterns, types and page number.
fixed_region
Extraction of a value at a fixed region (only depending on the coordinates). In most of the cases a verifier or a transformer is needed.
key-value-pair
Definition of a keyword and a direction (left, right, down, up) where the value is placed relatively to the keyword.
currency
Find the most common currency on the document.
extract-nothing
It is a placeholder for an extractor, because a field needs an extractor. Nothing is read in the process. The field exists exclusively for pure manual validation.
key-textblock
 Extraction of a continuous text block relative to a key token.
key-region
Extract information from a specified region relative to a key token. Vector is top left corner of key to top left corner of region.
 

Machine Learning (ML) Based Extractors for field sets

Name of extractor
Description
gnn-address-extractor
Extraction of an address-block.
gnn-line-item-extractor
Extraction of an n-tuple of repeatable values, like line items, banking details or payment conditions.
gnn-fieldset-extractor
Extraction of non-repeatable fieldsets.

Machine Learning (ML) Based Extractors for fields

Name of extractor
Description
gnn-field-extractor
Extraction of single values and regions, with keyword or without keyword

Extractor "extract-nothing"

It is a placeholder for an extractor, because a field needs an extractor. Nothing is read in the process. The field exists exclusively for pure manual validation.
 

Extractor "barcodes"

This extractor can be used to filter input barcodes with regex patterns, types and page number. 
 

Extractor "regex"

The regex extractor is used to extract a value can be found, with a clear structure. It also search for the correct fragment, and the whole content of the fragment is used as the result of the extraction process. It is also possible to define more than one regex patterns. The most top regex pattern has the highest priority, if multiple results are found.
 

Extractor "key-value-pair"

Definition of a keyword and a direction (left, right, down, up) where the value is placed relatively to the keyword.
As keyword (pattern) a regular expression can be defined. If multiple patterns are configured, the most top expression has the highest priority, in case that two or more patterns match. For each pattern a direction needs to be defined.
 

Extractor "fixed-region"

Definition of the region, where the result should be extracted. The whole text in the region is the result, so that in most cases a transformer is necessary to extract the correct data. The values top, left, right, bottom define the limits of the rectangle.
 
Parameter
Description
Pattern
Search pattern (regex)
Top
top coordinate of the rectangle in percent (0 <= percent <= 100)
Left
left coordinate of the rectangle in percent (0 <= percent <= 100)
Bottom
bottom coordinate of the rectangle in percent (0 <= percent <= 100)
Right
right coordinate of the rectangle in percent (0 <= percent <= 100)

Extractor "key-region"

Definition of a keyword, distance Vector (horizontal and vertical distance) and a Region Size (textbox width and height where the value is placed relatively to the keyword.
As keyword (pattern) a regular expression can be defined. If multiple patterns are configured, the most top expression has the highest priority, in case that two or more patterns match. Only one Vector and Region Size can be provided.
 
Parameter
Description
Pattern
Search pattern (regex)
Fuzzy
[default=disabled] Enables fuzzy regex with a max edit distance of 2
X
Top-left to top-left horizontal distance in fractional (0-1) page coordinates
Y
Top-left to top-left vertical distance in fractional (0-1) page coordinates
WIDTH
Width of the textbox in fractional (0-1) page coordinates
HEIGHT
Width of the textbox in fractional (0-1) page coordinates

Extractor "key-textblock"

Definition of a keyword, a direction (left, right, down, up) where the value is placed relatively to the keyword.
As keyword (pattern) a regular expression can be defined. If multiple patterns are configured, the most top expression has the highest priority, in case that two or more patterns match. For each pattern a direction needs to be defined.
 

Extractor "checkbox"

To extract checkboxes, first of all a keyword to extract a group of checkboxes is needed. According to this keyword the vectors to the checkboxes are stored.
 
 At the moment the checkbox-extractor can't be configured comfortable via GUI, only using input parameters (in JSON format).
{
"train_data": {
"scale": false,
"vectors": [
[0.1755, 0.0061], [0.1755,0.0418], [0.1755, 0.0775], [0.1755, 0.1132]
],
"key_text": "ProductA",
"key_descr": "ProductADesc",
"box_descrs": [
"ProductA", "ProductB", "ProductC", "ProductD"
],
"key_center": [
0.1539, 0.1748, 0
],
"box_regions": [
{ "top": 0.1665, "left": 0.298, "right": 0.3214, "bottom": 0.1822},
{ "top": 0.2125, "left": 0.298, "right": 0.3214, "bottom": 0.2282},
{ "top": 0.2585, "left": 0.298, "right": 0.3214, "bottom": 0.2742},
{ "top": 0.3045, "left": 0.298, "right": 0.3214, "bottom": 0.3202}
],
"vector_mode": "bottom-left"
}
}

ML-Extractor "gnn-address-extractor"

This is a machine learning-based extractor. It predicts the classes of objects (rectangular regions), passes the regions through an address-parser and returns the candidates constructed from them, where the predictions are the address objects/dictionaries within the predicted regions.
The address-detector should only be used in field types, because the region of the address is recognized in once and the single fields will be extracted out of the whole address using diverse rules. 
 In order to use an address field set, the mandatory fields must be present, as they span the rectangle of the address. The company name or the contact name are usually in the first line of an address, the city and the postal code are in the last line of an address.
 

ML-Extractor "gnn-line-item-extractor"

This is a machine learning-based extractor, and can be used parameter free. It predicts the classes of objects (rectangular regions) using multiclass object classification, and returns the candidates fieldset objects/dictionaries constructed from them.
This extractor is used, if multiple fields depend on each other, e.g. to recognize the line item data.
 

ML-Extractor "gnn-fieldset extractor"

This is a machine learning-based extractor, and can be used parameter free. It predicts the classes of objects (rectangular regions) using multiclass object classification, and returns the candidates fieldset objects/dictionaries constructed from them.
This extractor is used, if multiple fields depend on each other, e.g. to recognize the line item data.
 

ML-Extractor "gnn-field-extractor"

This is a machine learning-based extractor, and can be used parameter free. It predicts the single objects, passes the regions through a region-based text-filter and returns the candidates constructed from them, where the predictions are the texts within the predicted regions.