How is an uploaded document or batch processed in the Parashift Platform? What steps (Separation, Classification, Extraction) are there and what do they do? How are these workflow steps represented in our data model?
Every uploaded document and batch runs through the same workflow steps. Depending on the use case some steps can be turned off or be very extensive.
Following a short high-level diagram to explain the different workflow steps.
Users usually don't interact with this workflow step. It is a purely technical step that prepares incoming files for further processing. Files are converted into an internal working format and enhanced to remove skewing, rotation or background from the image.
One of the most important tasks in this workflow step is Optical Character Recognition (OCR) which transforms any text on the images into machine-readable text.
If you are only working with already properly separated documents (e.g. you only upload pdf files that only contain one document) this step can be skipped.
In this step, a batch, which is a collection of multiple documents uploaded together, is separated into single documents for further processing. Separation could be done purely manually by a user but ideally, of course, is automated using different methods.
Separation can be revisited also at a later stage in the process should a user see that a document is actually not properly separated and needs further splitting.
If you already know what kind of document type you are going to process this step can be skipped.
In the classification step, single documents are classified into different document types. Users can interact with exceptions that could not be classified automatically.
For most of our users, this is the most important step in the whole process. But as always it too can be skipped if you only want properly separated and classified documents.
Depending on the document type and configured fields we extract data from the document. Users can interact with the document, capture fields manually or validate fields that had low prediction confidence.
There are tons of configuration options and out of the box we have many standard document types and fields ready for your use to always extract the data points you need.
The following three attributes always show the current status of a document and in which of the above mentioned workflow steps the document currently is.
|overall document status, mostly either in_progress or done|
|current workflow step|
|current status of the current workflow_step|
Examples of the most important workflow & status combinations
|done||done||done||The document is completely processed, and all data can be fetched, including export files, document type and of course field data|
|in_progress||The document is waiting for manual interaction through a user, data can of course already be fetched but may change with validation (user interaction)|
|The processing has failed, check the uploaded file for correctness or contact Parashift Support|
Example API Calls
Filter for done documents
Filter for done documents that were not yet exported
The resulting list is then often processed (e.g. fetch field data for documents)
and documents then marked as exported which excludes them from the original query
Count documents awaiting Extraction Validation
I strongly recommend the following articles, going into detail about how a batch is related to documents, pages and especially input_files and the batch schema as well as document and field relation. Also how to upload documents to skip workflow steps and more.
- Relationships & Structure of Batches, Documents, Pages and Files
- Relationships & Structure of Documents & Fields
- Basics: Upload a Document or Batch
Also, check out our Postman API Documentation