The best practices are a guideline and recommendations for the configuration to ensure good extraction quality.
Preparation
Overview of layouts and variation
The first step when setting up a new document type is familiarizing yourself with all the possible layouts and variations of the document type. It is essential to know all possible locations and formats of the information that needs to be extracted.
Requirements definition
Define all necessary fields that must be included in the document type upfront. Adding new fields later in the process can lead to additional efforts to ensure proper training progress.
Compiling training and testing documents
Before the configuration, it is essential to prepare a training and testing set of documents, which includes as many document layouts and variations as possible, to ensure diverse, high-quality ground truth. The set should consist of at least 200 documents for simple document types and 500 for more complex types.
Configuration
Checking for Standard Document Types
Before configuring a new document type, the standard document types must be checked and evaluated to determine whether they can be used.
Checking for standard fields and fieldsets
In cases where no standard document type is suitable, the available standard fields and field sets must be evaluated.
If in doubt use ML based extractors
When configuring a field and the ideal extractor choice is unclear, it is beneficial always to choose a machine-learning extractor.
Training and testing
Initial testing
Before starting the training on document type, a document with every possible variation should be uploaded and validated. It is important to spot potential problems that could arise while validating the training documents.
Training
The training set for an individual document type can include varying numbers of documents. For simple document types, the set should include at least 200 documents, and for more complex document types, it should contain at least 500 documents.
Testing and benchmarking
During the training, the extraction quality has to be assessed. A benchmark can be requested over the Parashift support to get an accurate picture of the quality.