Best practices configuration

The best practices are a guideline and recommendations for the configuration to ensure good extraction quality.

Preparation 

Overview of layouts and variation

The first step when setting up a new document type is familiarizing yourself with all the possible layouts and variations of the document type. It is essential to know all possible locations and formats of the information that needs to be extracted.

Requirements definition

Define all necessary fields that must be included in the document type upfront. Adding new fields later in the process can lead to additional efforts to ensure proper training progress.

Compiling training and testing documents

Before the configuration, it is essential to prepare a training and testing set of documents, which includes as many document layouts and variations as possible, to ensure diverse, high-quality ground truth. The set should consist of at least 200 documents for simple document types and 500 for more complex types.

Configuration

Checking for Standard Document Types

Before configuring a new document type, the standard document types must be checked and evaluated to determine whether they can be used.

Checking for standard fields and fieldsets

In cases where no standard document type is suitable, the available standard fields and field sets must be evaluated.

If in doubt use ML based extractors

When configuring a field and the ideal extractor choice is unclear, it is beneficial always to choose a machine-learning extractor.

Training and testing

Initial testing

Before starting the training on document type, a document with every possible variation should be uploaded and validated. It is important to spot potential problems that could arise while validating the training documents.

Training

The training set for an individual document type can include varying numbers of documents. For simple document types, the set should include at least 200 documents, and for more complex document types, it should contain at least 500 documents.

Testing and benchmarking

During the training, the extraction quality has to be assessed. A benchmark can be requested over the Parashift support to get an accurate picture of the quality.