Learn how to use Transformers

How to transform the data extracted by the extractors

 

Transformers help in downstream processing of the preliminary extraction results coming from the extractors. These let the user apply some basic transformations to the extracted values. As an example we have a "Money Amount" transformer, which will transform all the following strings in the document to the same decimal number: `17'500.95`, `17,500.95`, `17 500.95`, `17500.95`, `USD 17500.95`, `17500.95 Euro` `-->` `17500.95`. By removing thousands separators, currencies etc.
Several transformers are processed in sequence. The transformers will be processed before the verifiers.
The transformers with a star symbol * follow in the next release.
Transformer
Description
String: Strip String
Removes white spaces
String: Search Regex
Regex search on the candidate to extract the value using regular expressions
String: Substitute
Searching for a value using regex and replace it with an fixed value
String: Only Characters*
Concatenates only the characters in the prediction and strips off the digits.
 
Hint: use transformer "String: Substitute" instead
[0-9] -> NOTHING
String: Date
Parses the prediction string in the candidates to return valid dates
String: Datetime
Parse the Candidate value for valid datetimes and return them in the format yyyy-mm-ddThh:mm:ss.
Number: percent to decimal*
Conversion of a percentage value to a decimal value, e.g. 72% -> 0.72
Number: Only Numbers
Remove everything that is not a number.
Number: Only Integers
Remove everything that is not an integer value.
Number: Only Floats
Remove everything that is not an decimal value.
Number: Money Amount
Removes common currency abbreviations and symbols, and handles thousands seperators.

 

Transformer "String: Strip String"

Removes white spaces.
**Example:** `CH03 4545 4545 4545 4545 1` `->` `CH0345454545454545451`

 

Transformer "String: Search Regex"

Searches for an input regex pattern, in the text of the candidate and if not found, retains the total text, or else returns the first matched subtext.
Parameter
Description
Pattern
Search string (regex)

 

Transformer "String: Substitute"

Searching for a value using regex and replace it with an fixed value. The rest of the result will not be touched, e.g. DIESELMOTOR -> DieselMOTOR.
In theory it is also possible to use the reorder function of regular expressions, if the expression contains round braces. \1 returns the content of the first round-brace-pair, e.g. (ABC)(DEF)(GHI)\3\1 as search string searches for ABCDEFGHI, but GHIABC is the result.
Parameter
Description
Search for (string)
Search string (regex)
Replace with (string)
replaced value

 

Transformer "String: Date"

Parses the prediction string in the candidates to return valid dates. A date is always divided into three bucks. With the parameter "DATE_ORDER" the order of interpretation can be defined via this transformer.
10.11.12 using the date order YMD -> 2010-11-12
10.11.12 using the date order DMY -> 2012-11-10
Parameter
Description
date order
order of the date DMY, YMD, ...

 

Transformer "String: Datetime"

Parse the Candidate value for valid datetimes and return them in the format yyyy-mm-ddThh:mm:ss.
e.g. 10.12.2021 10:17 -> 2021-10-12T10:17:00

 

Transformer "Number: Only Numbers"

Remove everything that is not a number. The consideration if it is a number is done per token.
e.g. ABC 123 456.11 DEF 78 XX99 -> 123 456.11 78
ABC, DEF and XX99 are deleted because these do not consist exclusively of numbers.

 

Transformer "Number: Only Integers"

Remove everything that is not an integer value. The consideration if it is a number is done per token.
e.g. ABC 123 456.11 DEF 78 XX99 -> 123 78
ABC, 456.11, DEF and XX99 are deleted because these do not consist exclusively of numbers.

Transformer "Number: Only Floats"

Remove everything that is not a float value. The consideration if it is a number is done per token.
e.g. ABC 123 456.11 DEF 78 XX99 -> 456.11
ABC, 123, DEF, 78 and XX99 are deleted because these do not consist exclusively of numbers.

Transformer "Number: Money Amount"

Removes common currency abbreviations and symbols, and handles thousands separators.
Only useful if Candidate Prediction is a price string: CHF 14.05.- / -10.10 $. The value will be interpreted with two decimal places.
e.g. CHF 14.05.- -> -14.05
14.4 -> 1.44
14.4 5 7 -> 144.57
14.4 X 2 -> None