How is an uploaded document or batch processed in the Parashift Platform? What steps (Separation, Classification, Extraction) are there and what do they do? How are these workflow steps represented in our data model?
Every uploaded document and batch runs through the same workflow steps. Depending on the use case some steps can be turned off or be very extensive.
Workflow Description
Following a short high-level diagram to explain the different workflow steps.
Inbound
Users usually don't interact with this workflow step. It is a purely technical step that prepares incoming files for further processing. Files are converted into an internal working format and enhanced to remove skewing, rotation or background from the image.
One of the most important tasks in this workflow step is Optical Character Recognition (OCR) which transforms any text on the images into machine-readable text.
Separation
If you are only working with already properly separated documents (e.g. you only upload pdf files that only contain one document) this step can be skipped.
In this step, a batch, which is a collection of multiple documents uploaded together, is separated into single documents for further processing. Separation could be done purely manually by a user but ideally, of course, is automated using different methods.
Separation can be revisited also at a later stage in the process should a user see that a document is actually not properly separated and needs further splitting.
Classification
If you already know what kind of document type you are going to process this step can be skipped.
In the classification step, single documents are classified into different document types. Users can interact with exceptions that could not be classified automatically.
Extraction
For most of our users, this is the most important step in the whole process. But as always it too can be skipped if you only want properly separated and classified documents.
Depending on the document type and configured fields we extract data from the document. Users can interact with the document, capture fields manually or validate fields that had low prediction confidence.
There are tons of configuration options and out of the box we have many standard document types and fields ready for your use to always extract the data points you need.
Data Model
The following three attributes always show the current status of a document and in which of the above mentioned workflow steps the document currently is.
attributes | allowed values | description |
status | pending in_progress done failed |
overall document status, mostly either in_progress or done |
workflow_step | inbound inbound_processing ocr classification classification_validation extraction extraction_validation outbound_processing done qc |
current workflow step |
workflow_status | started in_progress retry failed done |
current status of the current workflow_step |
Examples of the most important workflow & status combinations
status | workflow_step | workflow_status | Description |
done | done | done | The document is completely processed, and all data can be fetched, including export files, document type and of course field data |
in_progress |
classification_validation extraction_validation |
in_progress | The document is waiting for manual interaction through a user, data can of course already be fetched but may change with validation (user interaction) |
failed |
|
The processing has failed, check the uploaded file for correctness or contact Parashift Support |
Example API Calls
Filter for done documents
https://api.parashift.io/v2/documents?filter[status]=done
Filter for done documents that were not yet exported
https://api.parashift.io/v2/documents?filter[exported_at_blank]=true&filter[status]=done
The resulting list is then often processed (e.g. fetch field data for documents)
https://api.parashift.io/v2/documents/123456/?include=document_fields&extra_fields[document_fields]=extraction_candidates
and documents then marked as exported which excludes them from the original query
https://api.parashift.io/v2/documents/123456/mark_as_exported
Count documents awaiting Extraction Validation
https://api.parashift.io/v2/documents?filter[workflow_step]=extraction_validation&filter[workflow_status]=in_progress&stats[total]=count
Recommended Reading
I strongly recommend the following articles, going into detail about how a batch is related to documents, pages and especially input_files and the batch schema as well as document and field relation. Also how to upload documents to skip workflow steps and more.
- Relationships & Structure of Batches, Documents, Pages and Files
- Relationships & Structure of Documents & Fields
- Basics: Upload a Document or Batch
Also, check out our Postman API Documentation