How to transform the data extracted by the extractors
Transformers help in downstream processing of the preliminary extraction results coming from the extractors. These let the user apply some basic transformations to the extracted values. As an example we have a "Money Amount" transformer, which will transform all the following strings in the document to the same decimal number: `17'500.95`, `17,500.95`, `17 500.95`, `17500.95`, `USD 17500.95`, `17500.95 Euro` `-->` `17500.95`. By removing thousands separators, currencies etc.
Several transformers are processed in sequence. The transformers will be processed before the verifiers.
The transformers with a star symbol * follow in the next release.
Transformer
|
Description
|
String: Strip String
|
Removes white spaces
|
String: Search Regex
|
Regex search on the candidate to extract the value using regular expressions
|
String: Substitute
|
Searching for a value using regex and replace it with an fixed value
|
String: Only Characters*
|
Concatenates only the characters in the prediction and strips off the digits.
Hint: use transformer "String: Substitute" instead
[0-9] -> NOTHING
|
String: Date
|
Parses the prediction string in the candidates to return valid dates
|
String: Datetime
|
Parse the Candidate value for valid datetimes and return them in the format yyyy-mm-ddThh:mm:ss.
|
Number: percent to decimal*
|
Conversion of a percentage value to a decimal value, e.g. 72% -> 0.72
|
Number: Only Numbers
|
Remove everything that is not a number.
|
Number: Only Integers
|
Remove everything that is not an integer value.
|
Number: Only Floats
|
Remove everything that is not an decimal value.
|
Number: Money Amount
|
Removes common currency abbreviations and symbols, and handles thousands seperators.
|
Transformer "String: Strip String"
Removes white spaces.
**Example:** `CH03 4545 4545 4545 4545 1` `->` `CH0345454545454545451`
Transformer "String: Search Regex"
Searches for an input regex pattern, in the text of the candidate and if not found, retains the total text, or else returns the first matched subtext.
Parameter
|
Description
|
Pattern
|
Search string (regex)
|
Transformer "String: Substitute"
Searching for a value using regex and replace it with an fixed value. The rest of the result will not be touched, e.g. DIESELMOTOR -> DieselMOTOR.
In theory it is also possible to use the reorder function of regular expressions, if the expression contains round braces. \1 returns the content of the first round-brace-pair, e.g. (ABC)(DEF)(GHI)\3\1 as search string searches for ABCDEFGHI, but GHIABC is the result.
Parameter
|
Description
|
Search for (string)
|
Search string (regex)
|
Replace with (string)
|
replaced value
|
Transformer "String: Date"
Parses the prediction string in the candidates to return valid dates. A date is always divided into three bucks. With the parameter "DATE_ORDER" the order of interpretation can be defined via this transformer.
10.11.12 using the date order YMD -> 2010-11-12
10.11.12 using the date order DMY -> 2012-11-10
Parameter
|
Description
|
date order
|
order of the date DMY, YMD, ...
|
Transformer "String: Datetime"
Parse the Candidate value for valid datetimes and return them in the format yyyy-mm-ddThh:mm:ss.
e.g. 10.12.2021 10:17 -> 2021-10-12T10:17:00
Transformer "Number: Only Numbers"
Remove everything that is not a number. The consideration if it is a number is done per token.
e.g. ABC 123 456.11 DEF 78 XX99 -> 123 456.11 78
ABC, DEF and XX99 are deleted because these do not consist exclusively of numbers.
Transformer "Number: Only Integers"
Remove everything that is not an integer value. The consideration if it is a number is done per token.
e.g. ABC 123 456.11 DEF 78 XX99 -> 123 78
ABC, 456.11, DEF and XX99 are deleted because these do not consist exclusively of numbers.
Transformer "Number: Only Floats"
Remove everything that is not a float value. The consideration if it is a number is done per token.
e.g. ABC 123 456.11 DEF 78 XX99 -> 456.11
ABC, 123, DEF, 78 and XX99 are deleted because these do not consist exclusively of numbers.
Transformer "Number: Money Amount"
Removes common currency abbreviations and symbols, and handles thousands separators.
Only useful if Candidate Prediction is a price string: CHF 14.05.- / -10.10 $. The value will be interpreted with two decimal places.
e.g. CHF 14.05.- -> -14.05
14.4 -> 1.44
14.4 5 7 -> 144.57
14.4 X 2 -> None