Learn how to use Extractors

Extractors are the main block of the extraction pipeline and so, does the usual heavy-lifting. These let the user define how to extract a certain type of information, i. e. through machine learning (requires examples to train a model on) or through rule based approaches (do not need any examples and work right away). Depending on the use case, one or the other approach is better suited.

Template/Rule Based Extractors for fields

Name of extractor	Description
regex	Search for a string, where the defined regular expression fits. If more than one strings fit, to the regex, the system takes the first one.
barcodes	This extractor can be used to filter input barcodes with regex patterns, types and page number.
fixed_region	Extraction of a value at a fixed region (only depending on the coordinates). In most of the cases a verifier or a transformer is needed
key-value-pair	Definition of a keyword and a direction (left, right, down, up) where the value is placed relatively to the keyword
currency	Find the most common currency on the document
extract-nothing	It is a placeholder for an extractor, because a field needs an extractor. Nothing is read in the process. The field exists exclusively for pure manual validation.
Key-textblock	Extraction of a continous textblock relative to a key token

Machine Learning (ML) Based Extractors for field sets

Name of extractor	Description
gnn-address-extractor	Extraction of an address-block
repeatable-fieldset-detector	Extraction of an n-tuple of repeatable values, like line items, banking details or payment conditions
gnn-fieldset-extractor	Extraction of non repeatable fieldsets

Machine Learning (ML) Based Extractors for fields

Name of extractor	Description
Gnn field extractor	Extraction of single values and regions, with keyword or without keyword
gnn-address-extractor	Extraction of single field addresses

Fuzzy option

If the fuzzy option is enabled, also values are allowed which are similar to the searched pattern. The tolerance depends on the length of the word and the similarity of the characters.

Search-Pattern	Value	Matches
[A-Z]{3}	ABC	yes
[A-Z]{3}	AB5	no
[A-Z]{5}	ABCDE	yes
[A-Z]{5}	ABCD5	yes
[A-Z]{5}	AB6D5	no
[A-Z]{10}	ABCDEFGHIJ	yes
[A-Z]{10}	ABC65FGHIJ	yes

Output data types

The output data type describes, how the result of the extractor should be interpreted and how the field is displayed in the validation mask of Parashift Platform.

Output data type	Description
Allowed values	Definition of a list of allowed values, e.g. for the gender of a person (male / female / non-binary). -> If the extracted value is not in the allowed value list, the value will not be recognized and inserted.
Boolean	Usage of a checkbox, which signalize, if a value was found.
Date	Specify a date, which can be restricted, if a minimum or maximum date is defined.
Datetime	Date including hour and minute.
Float	Decimal value with up to two decimals
Integer	Integer value, without decimals
String	Text
Multi-Checkbox	Checkbox group / see "extractor 'checkbox'"
Page coordinates	Focus on coordinates of a value, not at the value.

Output data type - "Allowed values"

The allowed values are split in an internal value, which is stored in the background and is also received, using the API calls of the platform. The display value is only used for the front end.

Please pay attention the value is case sensitive, so if "Male" with an upper M is recognized the allowed value is not set. A transformer is mandatory in this case.

"Allowed values" in the configuration of a field:

"Allowed values" in the validation of a field:

Output data type "Boolean"

The output type boolean shows an checkbox. The checkbox is selected, if a value is available and something was recognized for the field. (it is value independent, so the value "true", "yes" enables the checkbox, as well as "no", "false", ...)

Output data type "Date"

The date type can be restricted, if a minimum or maximum date is defined.

"Date" in the configuration of a field:

"Date" in the validation of a field

If a minimum or maximum date is defined, this has no effect to the extraction of the value, but if the extracted or validated value is out of scope, an error occurs.

Output data type "Datetime"

The data type datetime also allows to restrict the date, like the data type date with a minimum and maximum date.

In the validation mask an hour and minute component is available.

Output data type "Float"

The float type is a number with two decimals. The value also can be restricted, using a minimum and/or maximum value.

"Float" in the configuration of a field:

"Float" in the validation of a field:

If a minimum or maximum integer is defined, this has no effect to the extraction of the value, but if the extracted or validated value is out of scope, an error occurs.

Output data type "Page coordinates"

Only extract the page coordinates. The value will be shown as image.

So only the coordinates of the value are available and stored, not the content.

Extractor "extract-nothing"

It is a placeholder for an extractor, because a field needs an extractor. Nothing is read in the process. The field exists exclusively for pure manual validation.

Extractor "barcodes"

This extractor can be used to filter input barcodes with regex patterns, types and page number. Supported barcodes types: Code128, Code39, datamatrix, EAN-13, EAN-8, QRCODE.

Parameter	Description
Page number	restriction of the page number 1 = first page
Pattern	Search pattern (regex)
Type	Barcode type, Code128, Code39, datamatrix, EAN-13, EAN-8, QRCODE

Extractor "regex"

The regex extractor is used to extract a value can be found, with a clear structure, e.g. an identification number, document date, ....

The regex extractor searches for the correct fragment, and the whole content of the fragment is used as the result of the extraction process.

Because of this, it could be necessary to use a transformer with the same regex pattern.

It is also possible to define more than one regex patterns. The most top regex pattern has the highest priority, if multiple results are found.

Parameter	Description
Page number	restriction of the page number 1 = first page
Pattern	Search pattern (regex)

Extractor "key-value-pair"

Definition of a keyword and a direction (left, right, down, up) where the value is placed relatively to the keyword.

As keyword (pattern) a regular expression can be defined. If multiple patterns are configured, the most top expression has the highest priority, in case that two or more patterns match. For each pattern a direction needs to be defined.

Parameter	Description
Page number	restriction of the page number 1 = first page
Pattern	Search pattern (regex)
Fuzzy	[default=disabled] Enables fuzzy regex with a max edit distance of 2
Direction	left, right, down, up where the value can be found

Extractor "fixed-region"

Definition of the region, where the result should be extracted. The whole text in the region is the result, so that in most cases a transformer is necessary to extract the correct data. The values top, left, right, bottom define the limits of the rectangle (in percent 0 ..100).

Parameter	Description
Page number	restriction of the page number 1 = first page
Pattern	Search pattern (regex)
Top	top coordinate of the rectangle in percent (0 <= percent <= 100)
Left	left coordinate of the rectangle in percent (0 <= percent <= 100)
Bottom	bottom coordinate of the rectangle in percent (0 <= percent <= 100)
Right	right coordinate of the rectangle in percent (0 <= percent <= 100)

Extractor "key-region"

Definition of a keyword, distance Vector (horizontal and vertical distance) and a Region Size (textbox width and height where the value is placed relatively to the keyword.

Parameter	Description
Page number	restriction of the page number 1 = first page
Pattern	Search pattern (regex)
Fuzzy	[default=disabled] Enables fuzzy regex with a max edit distance of 2
X	Top-left to top-left horizontal distance in fractional (0-1) page coordinates
Y	Top-left to top-left vertical distance in fractional (0-1) page coordinates
WIDTH	Width of the textbox in fractional (0-1) page coordinates
HEIGHT	Width of the textbox in fractional (0-1) page coordinates

Extractor "key-textblock"

Definition of a keyword, a direction (left, right, down, up) where the value is placed relatively to the keyword and Neighbour/clustering parameters.

Parameter	Description
Page number	restriction of the page number 1 = first page
Pattern	Search pattern (regex)
Fuzzy	[default=disabled] Enables fuzzy regex with a max edit distance of 2
Direction	left, right, down, up where the value can be found
EPS	"The maximum distance between two samples for one to be considered as in the neighborhood of the other. This is not a maximum bound on the distances of points within a cluster. This is the most important DBSCAN parameter to choose appropriately for your data set and distance function." (https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html)
MIN_SAMPLES	"The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself." (https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html)

Extractor "esr"

Extraction of the ESR inpayment slip (BVR/PVR/ISR).

Parameter	Description
Page number	restriction of the page number 1 = first page

title	extractor attribute
ESR reference (3)	reference
ESR account (4)	account
ESR amount (2)	amount
ESR currency (1)	currency
ESR code (5)	code

Extractor "qr-bill"

Extraction of the qr code with payment information

Parameter	Description
Page number	restriction of the page number 1 = first page

title	extractor attribute
Account	account
Amount	amount
Currency	currency
Company name (payable to)	cr_name
Street (payable to)	cr_street
House number (payable to)	cr_house_number
Postal code (payable to)	cr_postal_code
City (payable to)	cr_city
Country (payable to)	cr_country
Company name (payable by)	ud_name
Street (payable by)	ud_street
House number (payable to)	ud_house_number
Postal code (payable to)	ud_postal_code
City (payable to)	ud_city
Country (payable to)	ud_country
Reference type	ref_type
Reference	ref
Message	message
Billing Information	billing_info

Extractor "tax-ids"

Extraction of a sender and receiver tax id.

Parameter	Description
Page number	restriction of the page number 1 = first page

Also for this extractor the "extractor attribute" needs to be defined.

title	extractor attribute
Tax id (sender)	sender_tax_id
Tax id (receiver)	receiver_tax_id

Extractor "banking-details"

Extraction of banking accounts.

Parameter	Description
Page number	restriction of the page number 1 = first page

Also for this extractor the "extractor attribute" needs to be defined.

name	extractor attribute
Bank name	bank_name
IBAN	iban
BIC	bic
Street	street
House number	house_number
Postal code	postal_code
City	city_name

Extractor "checkbox"

To extract checkboxes, first of all a keyword to extract a group of checkboxes is needed. According to this keyword the vectors to the checkboxes are stored.

Parameter	Description
Page number	restriction of the page number 1 = first page
Label	Label of the i-th element
Identifier	internal identifier of the i-th element

At the moment the checkbox-extractor can't be configured comfortable via GUI, only using input parameters (in JSON format).

{
	"train_data": {
		"scale": false,
		"vectors": [
			[0.1755, 0.0061], [0.1755,0.0418], [0.1755, 0.0775], [0.1755, 0.1132]
		],
		"key_text": "ProductA",
		"key_descr": "ProductADesc",
		"box_descrs": [
			"ProductA", "ProductB", "ProductC", "ProductD"
		],
		"key_center": [
			0.1539, 0.1748, 0
		],
		"box_regions": [
			{  "top": 0.1665, "left": 0.298, "right": 0.3214, "bottom": 0.1822},
			{  "top": 0.2125, "left": 0.298, "right": 0.3214, "bottom": 0.2282},
			{  "top": 0.2585, "left": 0.298, "right": 0.3214, "bottom": 0.2742},
			{  "top": 0.3045, "left": 0.298, "right": 0.3214, "bottom": 0.3202}
		],
		"vector_mode": "bottom-left"
	}
}

ML-Extractor "address-detector" -> "object-detector-libpostal-multistep"

This is a machine learning-based extractor, and can be used parameter free. It predicts the classes of objects (rectangular regions), passes the regions through an address-parser and returns the candidates constructed from them, where the predictions are the address objects/dictionaries within the predicted regions.

The address-detector should only be used in field types, because the region of the address is recognized in once and the single fields will be extracted out of the whole address using diverse rules. The extractor attribute of these fields have to be defined, according to this table.

Parameter	Description
Page number	restriction of the page number 1 = first page
Threshold	0.02

In order to use an address field set, the mandatory fields must be present, as they span the rectangle of the address. The company name or the contact name are usually in the first line of an address, the city and the postal code are in the last line of an address.

name	extractor attribute	mandatory
Company name	company_name	yes
Title	salutation
Contact name	name_2	yes
Street	street
House number	house_number
Address suffix	street_2
Postal code	postal_code	yes
City	city_name	yes
Country	country_code

ML-Extractor "repeatable-fieldset-detector"

This is a machine learning-based extractor, and can be used parameter free. It predicts the classes of objects (rectangular regions) using multiclass object classification, and returns the candidates fieldset objects/dictionaries constructed from them.

This extractor is used, if multiple fields depend on each other, e.g. to recognize the line item data. In this case a structure in the background will be trained, something like: the quantity is a float value, left to the total amount and as head line the keyword "Quantity shipped" is used.

Parameter	Description
Page number	restriction of the page number 1 = first page
Threshold	0.02

ML-Extractor "object-detector"

This is a machine learning-based extractor, and can be used parameter free. It predicts the classes of objects (rectangular regions) and returns the candidates constructed from them.

Parameter	Description
Page number	restriction of the page number 1 = first page
Threshold	0.02

ML-Extractor "object-detector-fixed-region"

This is a machine learning-based extractor, and can be used parameter free. It predicts the classes of objects (rectangular regions), passes the regions through a region-based text-filter and returns the candidates constructed from them, where the predictions are the texts within the predicted regions.

Parameter	Description
Page number	restriction of the page number 1 = first page
Threshold	0.02

ML-Extractor "ml-key-value-pair"

This is a machine learning-based extractor, and can be used parameter free. It predicts values according to a trained keywords.

As example to extract the invoice number. According to the invoicing party and the used ERP-system a lot of keywords and key-value-relations are available, e.g. below of keyword "Invoice-No", right to "Rechnungsnummer", ....

The extractor will train the dictionary of possible keywords and relationships.

Parameter	Description
Page number	restriction of the page number 1 = first page
Threshold	0.02

ML-Extractor "document-language"

This is a machine learning-based language detector, which uses a library to detect language from the OCR text of a document.

Parameter	Description
Page number	restriction of the page number 1 = first page