Learn about different extraction tools and services available to be used while configuring a field/fieldset and how to set them
Extractors are the main block of the extraction pipeline and so, does the usual heavy-lifting. These let the user define how to extract a certain type of information, i. e. through machine learning (requires examples to train a model on) or through rule based approaches (do not need any examples and work right away). Depending on the use case, one or the other approach is better suited.
Template/Rule Based Extractors for fields
Name of extractor
|
Description
|
regex
|
Search for a string, where the defined regular expression fits.
If more than one strings fit, to the regex, the system takes the first one.
|
barcodes
|
This extractor can be used to filter input barcodes with regex patterns, types and page number.
|
fixed_region
|
Extraction of a value at a fixed region (only depending on the coordinates). In most of the cases a verifier or a transformer is needed
|
key-value-pair
|
Definition of a keyword and a direction (left, right, down, up) where the value is placed relatively to the keyword
|
currency
|
Find the most common currency on the document
|
extract-nothing
|
It is a placeholder for an extractor, because a field needs an extractor. Nothing is read in the process. The field exists exclusively for pure manual validation.
|
Key-textblock
|
Extraction of a continous textblock relative to a key token
|
Machine Learning (ML) Based Extractors for field sets
Name of extractor
|
Description
|
gnn-address-extractor
|
Extraction of an address-block
|
repeatable-fieldset-detector
|
Extraction of an n-tuple of repeatable values, like line items, banking details or payment conditions
|
gnn-fieldset-extractor
|
Extraction of non repeatable fieldsets
|
Machine Learning (ML) Based Extractors for fields
Name of extractor
|
Description
|
Gnn field extractor
|
Extraction of single values and regions, with keyword or without keyword
|
gnn-address-extractor
|
Extraction of single field addresses
|
Fuzzy option
If the fuzzy option is enabled, also values are allowed which are similar to the searched pattern. The tolerance depends on the length of the word and the similarity of the characters.
Search-Pattern
|
Value
|
Matches
|
[A-Z]{3}
|
ABC
|
yes
|
[A-Z]{3}
|
AB5
|
no
|
[A-Z]{5}
|
ABCDE
|
yes
|
[A-Z]{5}
|
ABCD5
|
yes
|
[A-Z]{5}
|
AB6D5
|
no
|
[A-Z]{10}
|
ABCDEFGHIJ
|
yes
|
[A-Z]{10}
|
ABC65FGHIJ
|
yes
|
Output data types
The output data type describes, how the result of the extractor should be interpreted and how the field is displayed in the validation mask of Parashift Platform.
Output data type
|
Description
|
Allowed values
|
Definition of a list of allowed values, e.g. for the gender of a person (male / female / non-binary).
-> If the extracted value is not in the allowed value list, the value will not be recognized and inserted.
|
Boolean
|
Usage of a checkbox, which signalize, if a value was found.
|
Date
|
Specify a date, which can be restricted, if a minimum or maximum date is defined.
|
Datetime
|
Date including hour and minute.
|
Float
|
Decimal value with up to two decimals
|
Integer
|
Integer value, without decimals
|
String
|
Text
|
Multi-Checkbox
|
Checkbox group / see "extractor 'checkbox'"
|
Page coordinates
|
Focus on coordinates of a value, not at the value.
|
Output data type - "Allowed values"
The allowed values are split in an internal value, which is stored in the background and is also received, using the API calls of the platform. The display value is only used for the front end.
Please pay attention the value is case sensitive, so if "Male" with an upper M is recognized the allowed value is not set. A transformer is mandatory in this case.
"Allowed values" in the configuration of a field:
"Allowed values" in the validation of a field:
Output data type "Boolean"
The output type boolean shows an checkbox. The checkbox is selected, if a value is available and something was recognized for the field. (it is value independent, so the value "true", "yes" enables the checkbox, as well as "no", "false", ...)
Output data type "Date"
The date type can be restricted, if a minimum or maximum date is defined.
"Date" in the configuration of a field:
"Date" in the validation of a field
If a minimum or maximum date is defined, this has no effect to the extraction of the value, but if the extracted or validated value is out of scope, an error occurs.
Output data type "Datetime"
The data type datetime also allows to restrict the date, like the data type date with a minimum and maximum date.
In the validation mask an hour and minute component is available.
Output data type "Float"
The float type is a number with two decimals. The value also can be restricted, using a minimum and/or maximum value.
"Float" in the configuration of a field:
"Float" in the validation of a field:
If a minimum or maximum integer is defined, this has no effect to the extraction of the value, but if the extracted or validated value is out of scope, an error occurs.
Output data type "Page coordinates"
Only extract the page coordinates. The value will be shown as image.
So only the coordinates of the value are available and stored, not the content.
Extractor "extract-nothing"
It is a placeholder for an extractor, because a field needs an extractor. Nothing is read in the process. The field exists exclusively for pure manual validation.
Extractor "barcodes"
This extractor can be used to filter input barcodes with regex patterns, types and page number. Supported barcodes types: Code128, Code39, datamatrix, EAN-13, EAN-8, QRCODE.
Parameter
|
Description
|
Page number
|
restriction of the page number
1 = first page
|
Pattern
|
Search pattern (regex)
|
Type
|
Barcode type, Code128, Code39, datamatrix, EAN-13, EAN-8, QRCODE
|
Extractor "regex"
The regex extractor is used to extract a value can be found, with a clear structure, e.g. an identification number, document date, ....
The regex extractor searches for the correct fragment, and the whole content of the fragment is used as the result of the extraction process.
Because of this, it could be necessary to use a transformer with the same regex pattern.
It is also possible to define more than one regex patterns. The most top regex pattern has the highest priority, if multiple results are found.
Parameter
|
Description
|
Page number
|
restriction of the page number
1 = first page
|
Pattern
|
Search pattern (regex)
|
Extractor "key-value-pair"
Definition of a keyword and a direction (left, right, down, up) where the value is placed relatively to the keyword.
As keyword (pattern) a regular expression can be defined. If multiple patterns are configured, the most top expression has the highest priority, in case that two or more patterns match. For each pattern a direction needs to be defined.
Parameter
|
Description
|
Page number
|
restriction of the page number
1 = first page
|
Pattern
|
Search pattern (regex)
|
Fuzzy
|
[default=disabled] Enables fuzzy regex with a max edit distance of 2
|
Direction
|
left, right, down, up where the value can be found
|
Extractor "fixed-region"
Definition of the region, where the result should be extracted. The whole text in the region is the result, so that in most cases a transformer is necessary to extract the correct data. The values top, left, right, bottom define the limits of the rectangle (in percent 0 ..100).
Parameter
|
Description
|
Page number
|
restriction of the page number
1 = first page
|
Pattern
|
Search pattern (regex)
|
Top
|
top coordinate of the rectangle in percent (0 <= percent <= 100)
|
Left
|
left coordinate of the rectangle in percent (0 <= percent <= 100)
|
Bottom
|
bottom coordinate of the rectangle in percent (0 <= percent <= 100)
|
Right
|
right coordinate of the rectangle in percent (0 <= percent <= 100)
|
Extractor "key-region"
Definition of a keyword, distance Vector (horizontal and vertical distance) and a Region Size (textbox width and height where the value is placed relatively to the keyword.
As keyword (pattern) a regular expression can be defined. If multiple patterns are configured, the most top expression has the highest priority, in case that two or more patterns match. Only one Vector and Region Size can be provided.
Parameter
|
Description
|
Page number
|
restriction of the page number
1 = first page
|
Pattern
|
Search pattern (regex)
|
Fuzzy
|
[default=disabled] Enables fuzzy regex with a max edit distance of 2
|
X
|
Top-left to top-left horizontal distance in fractional (0-1) page coordinates
|
Y
|
Top-left to top-left vertical distance in fractional (0-1) page coordinates
|
WIDTH
|
Width of the textbox in fractional (0-1) page coordinates
|
HEIGHT
|
Width of the textbox in fractional (0-1) page coordinates
|
Extractor "key-textblock"
Definition of a keyword, a direction (left, right, down, up) where the value is placed relatively to the keyword and Neighbour/clustering parameters.
As keyword (pattern) a regular expression can be defined. If multiple patterns are configured, the most top expression has the highest priority, in case that two or more patterns match. For each pattern a direction needs to be defined.
Parameter
|
Description
|
Page number
|
restriction of the page number
1 = first page
|
Pattern
|
Search pattern (regex)
|
Fuzzy
|
[default=disabled] Enables fuzzy regex with a max edit distance of 2
|
Direction
|
left, right, down, up where the value can be found
|
EPS
|
"The maximum distance between two samples for one to be considered as in the neighborhood of the other. This is not a maximum bound on the distances of points within a cluster. This is the most important DBSCAN parameter to choose appropriately for your data set and distance function." (https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html)
|
MIN_SAMPLES
|
"The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself." (https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html)
|
Extractor "esr"
Extraction of the ESR inpayment slip (BVR/PVR/ISR).
Parameter
|
Description
|
Page number
|
restriction of the page number
1 = first page
|
title
|
extractor attribute
|
ESR reference (3)
|
reference
|
ESR account (4)
|
account
|
ESR amount (2)
|
amount
|
ESR currency (1)
|
currency
|
ESR code (5)
|
code
|
Extractor "qr-bill"
Extraction of the qr code with payment information
Parameter
|
Description
|
Page number
|
restriction of the page number
1 = first page
|
title
|
extractor attribute
|
Account
|
account
|
Amount
|
amount
|
Currency
|
currency
|
Company name (payable to)
|
cr_name
|
Street (payable to)
|
cr_street
|
House number (payable to)
|
cr_house_number
|
Postal code (payable to)
|
cr_postal_code
|
City (payable to)
|
cr_city
|
Country (payable to)
|
cr_country
|
Company name (payable by)
|
ud_name
|
Street (payable by)
|
ud_street
|
House number (payable to)
|
ud_house_number
|
Postal code (payable to)
|
ud_postal_code
|
City (payable to)
|
ud_city
|
Country (payable to)
|
ud_country
|
Reference type
|
ref_type
|
Reference
|
ref
|
Message
|
message
|
Billing Information
|
billing_info
|
Extractor "tax-ids"
Extraction of a sender and receiver tax id.
Parameter
|
Description
|
Page number
|
restriction of the page number
1 = first page
|
Also for this extractor the "extractor attribute" needs to be defined.
title
|
extractor attribute
|
Tax id (sender)
|
sender_tax_id
|
Tax id (receiver)
|
receiver_tax_id
|
Extractor "banking-details"
Extraction of banking accounts.
Parameter
|
Description
|
Page number
|
restriction of the page number
1 = first page
|
Also for this extractor the "extractor attribute" needs to be defined.
name
|
extractor attribute
|
Bank name
|
bank_name
|
IBAN
|
iban
|
BIC
|
bic
|
Street
|
street
|
House number
|
house_number
|
Postal code
|
postal_code
|
City
|
city_name
|
Extractor "checkbox"
To extract checkboxes, first of all a keyword to extract a group of checkboxes is needed. According to this keyword the vectors to the checkboxes are stored.
Parameter
|
Description
|
Page number
|
restriction of the page number
1 = first page
|
Label
|
Label of the i-th element
|
Identifier
|
internal identifier of the i-th element
|
At the moment the checkbox-extractor can't be configured comfortable via GUI, only using input parameters (in JSON format).
{
"train_data": {
"scale": false,
"vectors": [
[0.1755, 0.0061], [0.1755,0.0418], [0.1755, 0.0775], [0.1755, 0.1132]
],
"key_text": "ProductA",
"key_descr": "ProductADesc",
"box_descrs": [
"ProductA", "ProductB", "ProductC", "ProductD"
],
"key_center": [
0.1539, 0.1748, 0
],
"box_regions": [
{ "top": 0.1665, "left": 0.298, "right": 0.3214, "bottom": 0.1822},
{ "top": 0.2125, "left": 0.298, "right": 0.3214, "bottom": 0.2282},
{ "top": 0.2585, "left": 0.298, "right": 0.3214, "bottom": 0.2742},
{ "top": 0.3045, "left": 0.298, "right": 0.3214, "bottom": 0.3202}
],
"vector_mode": "bottom-left"
}
}
ML-Extractor "address-detector" -> "object-detector-libpostal-multistep"
This is a machine learning-based extractor, and can be used parameter free. It predicts the classes of objects (rectangular regions), passes the regions through an address-parser and returns the candidates constructed from them, where the predictions are the address objects/dictionaries within the predicted regions.
The address-detector should only be used in field types, because the region of the address is recognized in once and the single fields will be extracted out of the whole address using diverse rules. The extractor attribute of these fields have to be defined, according to this table.
Parameter
|
Description
|
Page number
|
restriction of the page number
1 = first page
|
Threshold
|
0.02
|
In order to use an address field set, the mandatory fields must be present, as they span the rectangle of the address. The company name or the contact name are usually in the first line of an address, the city and the postal code are in the last line of an address.
name
|
extractor attribute
|
mandatory
|
Company name
|
company_name
|
yes
|
Title
|
salutation
|
|
Contact name
|
name_2
|
yes
|
Street
|
street
|
|
House number
|
house_number
|
|
Address suffix
|
street_2
|
|
Postal code
|
postal_code
|
yes
|
City
|
city_name
|
yes
|
Country
|
country_code
|
ML-Extractor "repeatable-fieldset-detector"
This is a machine learning-based extractor, and can be used parameter free. It predicts the classes of objects (rectangular regions) using multiclass object classification, and returns the candidates fieldset objects/dictionaries constructed from them.
This extractor is used, if multiple fields depend on each other, e.g. to recognize the line item data. In this case a structure in the background will be trained, something like: the quantity is a float value, left to the total amount and as head line the keyword "Quantity shipped" is used.
Parameter
|
Description
|
Page number
|
restriction of the page number
1 = first page
|
Threshold
|
0.02
|
ML-Extractor "object-detector"
This is a machine learning-based extractor, and can be used parameter free. It predicts the classes of objects (rectangular regions) and returns the candidates constructed from them.
Parameter
|
Description
|
Page number
|
restriction of the page number
1 = first page
|
Threshold
|
0.02
|
ML-Extractor "object-detector-fixed-region"
This is a machine learning-based extractor, and can be used parameter free. It predicts the classes of objects (rectangular regions), passes the regions through a region-based text-filter and returns the candidates constructed from them, where the predictions are the texts within the predicted regions.
Parameter
|
Description
|
Page number
|
restriction of the page number
1 = first page
|
Threshold
|
0.02
|
ML-Extractor "ml-key-value-pair"
This is a machine learning-based extractor, and can be used parameter free. It predicts values according to a trained keywords.
As example to extract the invoice number. According to the invoicing party and the used ERP-system a lot of keywords and key-value-relations are available, e.g. below of keyword "Invoice-No", right to "Rechnungsnummer", ....
The extractor will train the dictionary of possible keywords and relationships.
Parameter
|
Description
|
Page number
|
restriction of the page number
1 = first page
|
Threshold
|
0.02
|
ML-Extractor "document-language"
This is a machine learning-based language detector, which uses a library to detect language from the OCR text of a document.
Parameter
|
Description
|
Page number
|
restriction of the page number
1 = first page
|