Learn how to use Extractors

Learn about different extraction tools and services available to be used while configuring a field/fieldset and how to set them

Extractors are the main block of the extraction pipeline and so, does the usual heavy-lifting. These let the user define how to extract a certain type of information, i. e. through machine learning (requires examples to train a model on) or through rule based approaches (do not need any examples and work right away). Depending on the use case, one or the other approach is better suited.

Template/Rule Based Extractors for fields

Name of extractor
Description
regex
Search for a string, where the defined regular expression fits.
If more than one strings fit, to the regex, the system takes the first one.
barcodes
This extractor can be used to filter input barcodes with regex patterns, types and page number.
fixed_region
Extraction of a value at a fixed region (only depending on the coordinates). In most of the cases a verifier or a transformer is needed
key-value-pair
Definition of a keyword and a direction (left, right, down, up) where the value is placed relatively to the keyword
currency
Find the most common currency on the document
extract-nothing
It is a placeholder for an extractor, because a field needs an extractor. Nothing is read in the process. The field exists exclusively for pure manual validation.
Key-textblock
 Extraction of a continous textblock relative to a key token
 

Machine Learning (ML) Based Extractors for field sets

Name of extractor
Description
gnn-address-extractor
Extraction of an address-block
repeatable-fieldset-detector
Extraction of an n-tuple of repeatable values, like line items, banking details or payment conditions
gnn-fieldset-extractor
Extraction of non repeatable fieldsets 

Machine Learning (ML) Based Extractors for fields

Name of extractor
Description
Gnn field extractor
Extraction of single values and regions, with keyword or without keyword
gnn-address-extractor
Extraction of single field addresses

Fuzzy option

If the fuzzy option is enabled, also values are allowed which are similar to the searched pattern. The tolerance depends on the length of the word and the similarity of the characters.
Search-Pattern
Value
Matches
[A-Z]{3}
ABC
yes
[A-Z]{3}
AB5
no
[A-Z]{5}
ABCDE
yes
[A-Z]{5}
ABCD5
yes
[A-Z]{5}
AB6D5
no
[A-Z]{10}
ABCDEFGHIJ
yes
[A-Z]{10}
ABC65FGHIJ
yes

Output data types

The output data type describes, how the result of the extractor should be interpreted and how the field is displayed in the validation mask of Parashift Platform.
Output data type
Description
Allowed values
Definition of a list of allowed values, e.g. for the gender of a person (male / female / non-binary).
-> If the extracted value is not in the allowed value list, the value will not be recognized and inserted.
Boolean
Usage of a checkbox, which signalize, if a value was found.
Date
Specify a date, which can be restricted, if a minimum or maximum date is defined.
Datetime
Date including hour and minute.
Float
Decimal value with up to two decimals
Integer
Integer value, without decimals
String
Text
Multi-Checkbox
Checkbox group / see "extractor 'checkbox'"
Page coordinates
Focus on coordinates of a value, not at the value.

Output data type - "Allowed values"

The allowed values are split in an internal value, which is stored in the background and is also received, using the API calls of the platform. The display value is only used for the front end.
Please pay attention the value is case sensitive, so if "Male" with an upper M is recognized the allowed value is not set. A transformer is mandatory in this case.
"Allowed values" in the configuration of a field:
"Allowed values" in the validation of a field:

Output data type "Boolean"

The output type boolean shows an checkbox. The checkbox is selected, if a value is available and something was recognized for the field. (it is value independent, so the value "true", "yes" enables the checkbox, as well as "no", "false", ...)

Output data type "Date"

The date type can be restricted, if a minimum or maximum date is defined.
"Date" in the configuration of a field:
"Date" in the validation of a field
If a minimum or maximum date is defined, this has no effect to the extraction of the value, but if the extracted or validated value is out of scope, an error occurs.

Output data type "Datetime"

The data type datetime also allows to restrict the date, like the data type date with a minimum and maximum date.
In the validation mask an hour and minute component is available.

Output data type "Float"

The float type is a number with two decimals. The value also can be restricted, using a minimum and/or maximum value.
"Float" in the configuration of a field:
"Float" in the validation of a field:
If a minimum or maximum integer is defined, this has no effect to the extraction of the value, but if the extracted or validated value is out of scope, an error occurs.

Output data type "Page coordinates"

Only extract the page coordinates. The value will be shown as image.
So only the coordinates of the value are available and stored, not the content.

Extractor "extract-nothing"

It is a placeholder for an extractor, because a field needs an extractor. Nothing is read in the process. The field exists exclusively for pure manual validation.

Extractor "barcodes"

This extractor can be used to filter input barcodes with regex patterns, types and page number. Supported barcodes types: Code128, Code39, datamatrix, EAN-13, EAN-8, QRCODE.
Parameter
Description
Page number
restriction of the page number
1 = first page
Pattern
Search pattern (regex)
Type
Barcode type, Code128, Code39, datamatrix, EAN-13, EAN-8, QRCODE

 

Extractor "regex"

The regex extractor is used to extract a value can be found, with a clear structure, e.g. an identification number, document date, ....
The regex extractor searches for the correct fragment, and the whole content of the fragment is used as the result of the extraction process.
Because of this, it could be necessary to use a transformer with the same regex pattern.
It is also possible to define more than one regex patterns. The most top regex pattern has the highest priority, if multiple results are found.
Parameter
Description
Page number
restriction of the page number
1 = first page
Pattern
Search pattern (regex)

 

Extractor "key-value-pair"

Definition of a keyword and a direction (left, right, down, up) where the value is placed relatively to the keyword.
As keyword (pattern) a regular expression can be defined. If multiple patterns are configured, the most top expression has the highest priority, in case that two or more patterns match. For each pattern a direction needs to be defined.
Parameter
Description
Page number
restriction of the page number
1 = first page
Pattern
Search pattern (regex)
Fuzzy
[default=disabled] Enables fuzzy regex with a max edit distance of 2
Direction
left, right, down, up where the value can be found

 

Extractor "fixed-region"

Definition of the region, where the result should be extracted. The whole text in the region is the result, so that in most cases a transformer is necessary to extract the correct data. The values top, left, right, bottom define the limits of the rectangle (in percent 0 ..100).
Parameter
Description
Page number
restriction of the page number
1 = first page
Pattern
Search pattern (regex)
Top
top coordinate of the rectangle in percent (0 <= percent <= 100)
Left
left coordinate of the rectangle in percent (0 <= percent <= 100)
Bottom
bottom coordinate of the rectangle in percent (0 <= percent <= 100)
Right
right coordinate of the rectangle in percent (0 <= percent <= 100)

 

Extractor "key-region"

Definition of a keyword, distance Vector (horizontal and vertical distance) and a Region Size (textbox width and height where the value is placed relatively to the keyword.
As keyword (pattern) a regular expression can be defined. If multiple patterns are configured, the most top expression has the highest priority, in case that two or more patterns match. Only one Vector and Region Size can be provided.
 
 
Parameter
Description
Page number
restriction of the page number
1 = first page
Pattern
Search pattern (regex)
Fuzzy
[default=disabled] Enables fuzzy regex with a max edit distance of 2
X
Top-left to top-left horizontal distance in fractional (0-1) page coordinates
Y
Top-left to top-left vertical distance in fractional (0-1) page coordinates
WIDTH
Width of the textbox in fractional (0-1) page coordinates
HEIGHT
Width of the textbox in fractional (0-1) page coordinates

 

Extractor "key-textblock"

Definition of a keyword, a direction (left, right, down, up) where the value is placed relatively to the keyword and Neighbour/clustering parameters.
As keyword (pattern) a regular expression can be defined. If multiple patterns are configured, the most top expression has the highest priority, in case that two or more patterns match. For each pattern a direction needs to be defined.
 
 
Parameter
Description
Page number
restriction of the page number
1 = first page
Pattern
Search pattern (regex)
Fuzzy
[default=disabled] Enables fuzzy regex with a max edit distance of 2
Direction
left, right, down, up where the value can be found
EPS
"The maximum distance between two samples for one to be considered as in the neighborhood of the other. This is not a maximum bound on the distances of points within a cluster. This is the most important DBSCAN parameter to choose appropriately for your data set and distance function." (https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html)
MIN_SAMPLES
"The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself." (https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html)

 

Extractor "esr"

Extraction of the ESR inpayment slip (BVR/PVR/ISR).
Parameter
Description
Page number
restriction of the page number
1 = first page
title
extractor attribute
ESR reference (3)
reference
ESR account (4)
account
ESR amount (2)
amount
ESR currency (1)
currency
ESR code (5)
code

 

Extractor "qr-bill"

Extraction of the qr code with payment information
Parameter
Description
Page number
restriction of the page number
1 = first page
title
extractor attribute
Account
account
Amount
amount
Currency
currency
Company name (payable to)
cr_name
Street (payable to)
cr_street
House number (payable to)
cr_house_number
Postal code (payable to)
cr_postal_code
City (payable to)
cr_city
Country (payable to)
cr_country
Company name (payable by)
ud_name
Street (payable by)
ud_street
House number (payable to)
ud_house_number
Postal code (payable to)
ud_postal_code
City (payable to)
ud_city
Country (payable to)
ud_country
Reference type
ref_type
Reference
ref
Message
message
Billing Information
billing_info

 

Extractor "tax-ids"

Extraction of a sender and receiver tax id.
Parameter
Description
Page number
restriction of the page number
1 = first page
Also for this extractor the "extractor attribute" needs to be defined.
title
extractor attribute
Tax id (sender)
sender_tax_id
Tax id (receiver)
receiver_tax_id

 

Extractor "banking-details"

Extraction of banking accounts.
Parameter
Description
Page number
restriction of the page number
1 = first page
Also for this extractor the "extractor attribute" needs to be defined.
name
extractor attribute
Bank name
bank_name
IBAN
iban
BIC
bic
Street
street
House number
house_number
Postal code
postal_code
City
city_name
 
 

Extractor "checkbox"

To extract checkboxes, first of all a keyword to extract a group of checkboxes is needed. According to this keyword the vectors to the checkboxes are stored.
Parameter
Description
Page number
restriction of the page number
1 = first page
Label
Label of the i-th element
Identifier
internal identifier of the i-th element
At the moment the checkbox-extractor can't be configured comfortable via GUI, only using input parameters (in JSON format).
{
"train_data": {
"scale": false,
"vectors": [
[0.1755, 0.0061], [0.1755,0.0418], [0.1755, 0.0775], [0.1755, 0.1132]
],
"key_text": "ProductA",
"key_descr": "ProductADesc",
"box_descrs": [
"ProductA", "ProductB", "ProductC", "ProductD"
],
"key_center": [
0.1539, 0.1748, 0
],
"box_regions": [
{ "top": 0.1665, "left": 0.298, "right": 0.3214, "bottom": 0.1822},
{ "top": 0.2125, "left": 0.298, "right": 0.3214, "bottom": 0.2282},
{ "top": 0.2585, "left": 0.298, "right": 0.3214, "bottom": 0.2742},
{ "top": 0.3045, "left": 0.298, "right": 0.3214, "bottom": 0.3202}
],
"vector_mode": "bottom-left"
}
}

ML-Extractor "address-detector" -> "object-detector-libpostal-multistep"

This is a machine learning-based extractor, and can be used parameter free. It predicts the classes of objects (rectangular regions), passes the regions through an address-parser and returns the candidates constructed from them, where the predictions are the address objects/dictionaries within the predicted regions.
The address-detector should only be used in field types, because the region of the address is recognized in once and the single fields will be extracted out of the whole address using diverse rules. The extractor attribute of these fields have to be defined, according to this table.
Parameter
Description
Page number
restriction of the page number
1 = first page
Threshold
0.02
In order to use an address field set, the mandatory fields must be present, as they span the rectangle of the address. The company name or the contact name are usually in the first line of an address, the city and the postal code are in the last line of an address.
name
extractor attribute
mandatory
Company name
company_name
yes
Title
salutation
 
Contact name
name_2
yes
Street
street
 
House number
house_number
 
Address suffix
street_2
 
Postal code
postal_code
yes
City
city_name
yes
Country
country_code
 

 

ML-Extractor "repeatable-fieldset-detector"

This is a machine learning-based extractor, and can be used parameter free. It predicts the classes of objects (rectangular regions) using multiclass object classification, and returns the candidates fieldset objects/dictionaries constructed from them.
This extractor is used, if multiple fields depend on each other, e.g. to recognize the line item data. In this case a structure in the background will be trained, something like: the quantity is a float value, left to the total amount and as head line the keyword "Quantity shipped" is used.
Parameter
Description
Page number
restriction of the page number
1 = first page
Threshold
0.02

 

ML-Extractor "object-detector"

This is a machine learning-based extractor, and can be used parameter free. It predicts the classes of objects (rectangular regions) and returns the candidates constructed from them.
 
Parameter
Description
Page number
restriction of the page number
1 = first page
Threshold
0.02

 

ML-Extractor "object-detector-fixed-region"

This is a machine learning-based extractor, and can be used parameter free. It predicts the classes of objects (rectangular regions), passes the regions through a region-based text-filter and returns the candidates constructed from them, where the predictions are the texts within the predicted regions.
Parameter
Description
Page number
restriction of the page number
1 = first page
Threshold
0.02

 

ML-Extractor "ml-key-value-pair"

This is a machine learning-based extractor, and can be used parameter free. It predicts values according to a trained keywords.
As example to extract the invoice number. According to the invoicing party and the used ERP-system a lot of keywords and key-value-relations are available, e.g. below of keyword "Invoice-No", right to "Rechnungsnummer", ....
The extractor will train the dictionary of possible keywords and relationships.
Parameter
Description
Page number
restriction of the page number
1 = first page
Threshold
0.02

 

ML-Extractor "document-language"

This is a machine learning-based language detector, which uses a library to detect language from the OCR text of a document.
Parameter
Description
Page number
restriction of the page number
1 = first page