How to validate correctly
Document validation plays a vital role in ensuring that the machine learning model trains accurately. When validating documents, consistency is key. Ensuring that fields and data are marked in the same way across the document helps the system learn better and avoid errors in future extractions. If you encounter inconsistency, it could lead to incorrect data being processed or classified.
Important: Before diving into document validation, it’s crucial to familiarize yourself with the features of Highlighting and Document Viewer. These tools are essential for ensuring accurate validation and effective machine learning model training.
General Tips for Effective Validation
To maintain consistency, start your validation from the first few pages or top sections of the document. While this is not a strict requirement, it often helps to set a clear structure that guides the validation process. By marking the first pages properly, you create a pattern that can help ensure the rest of the document follows the same structure.
Make sure to validate the entire section or table, not just the first few lines. Partial validation, such as marking only the first few rows in a table, can lead to incomplete training, which in turn may affect future document extractions.
Remember: Sometimes, even if everything appears correct, it's essential to confirm the entire section. Missing a single field in a table or document can result in that part of the data being excluded from the training process, reducing the accuracy of the system’s learning.
OCR Errors and Confidence Warnings
OCR (Optical Character Recognition) errors can sometimes lead to discrepancies in data extraction. If the system fails to recognize a value, it may learn the position of the value on the document rather than the actual content. It’s important to be aware of this distinction during validation to ensure the system learns the correct patterns.
Moreover, there can be differences in recognition versus prediction confidence. In some cases, the system might show lower confidence in certain values, signaling that it's unsure about the extracted data. On the other hand, higher confidence usually means the data is more reliable. This distinction helps you understand how the model processes different types of data and ensures that you can make informed decisions during validation.
This topic could be explored further in another article, but understanding the basics here will improve the quality of your validation.
Field-Specific Validation
Different types of fields require specific validation techniques to ensure that data is extracted correctly. Here are some key guidelines:
- Addresses: When validating address fields, make sure to keep the entire address together. Don’t split the top and bottom parts. Consistent marking of addresses helps the system learn correct address structures. Remember to capture values from one address block.
- Line Items: For tables with multiple line items (like purchase orders or invoices), each line should be validated independently. This ensures that similar data is captured accurately across different pages or sections of the document.
- Amounts (Repeatable Fields): Amount fields, particularly those that repeat across a document (such as totals or unit prices), should be validated carefully. Make sure that every instance of an amount is correctly marked, as repeated amounts across pages need to be recognized properly by the system.
Highlighting and Document Viewer
For optimal validation, the Aggressive Highlighting mode can often be more effective. This mode helps identify discrepancies in the recognition process, which can be especially useful for spotting subtle errors that may not be immediately obvious. However, be mindful when using this mode as it displays a higher level of detail, which can sometimes be overwhelming.