Basics: Document & Batch Workflow

How is an uploaded document or batch processed in the Parashift Platform? What steps (Separation, Classification, Extraction) are there and what do they do? How are these workflow steps represented in our data model?

Every uploaded document and batch runs through the same workflow steps. Depending on the use case some steps can be turned off or be very extensive.

Workflow Description

Following a short high-level diagram to explain the different workflow steps.

Inbound

Users usually don't interact with this workflow step. It is a purely technical step that prepares incoming files for further processing. Files are converted into an internal working format and enhanced to remove skewing, rotation or background from the image.

One of the most important tasks in this workflow step is Optical Character Recognition (OCR) which transforms any text on the images into machine-readable text.

Separation

If you are only working with already properly separated documents (e.g. you only upload pdf files that only contain one document) this step can be skipped.

In this step, a batch, which is a collection of multiple documents uploaded together, is separated into single documents for further processing. Separation could be done purely manually by a user but ideally, of course, is automated using different methods.

Separation can be revisited also at a later stage in the process should a user see that a document is actually not properly separated and needs further splitting.

Classification

If you already know what kind of document type you are going to process this step can be skipped.

In the classification step, single documents are classified into different document types. Users can interact with exceptions that could not be classified automatically.

Extraction

For most of our users, this is the most important step in the whole process. But as always it too can be skipped if you only want properly separated and classified documents.

Depending on the document type and configured fields we extract data from the document. Users can interact with the document, capture fields manually or validate fields that had low prediction confidence.

There are tons of configuration options and out of the box we have many standard document types and fields ready for your use to always extract the data points you need.

 

Data Model

The following three attributes always show the current status of a document and in which of the above mentioned workflow steps the document currently is.

attributes allowed values description
status pending
in_progress
done
failed
overall document status, mostly either in_progress or done
workflow_step inbound
inbound_processing
ocr
classification
classification_validation
extraction
extraction_validation
outbound_processing
done
qc
current workflow step
workflow_status started
in_progress
retry
failed
done
current status of the current workflow_step

 

Examples of the most important workflow & status combinations

 

status workflow_step workflow_status Description
done done done The document is completely processed, and all data can be fetched, including export files, document type and of course field data
in_progress

classification_validation

extraction_validation

in_progress The document is waiting for manual interaction through a user, data can of course already be fetched but may change with validation (user interaction)
failed

 

  The processing has failed, check the uploaded file for correctness or contact Parashift Support

Example API Calls

Filter for done documents

https://api.parashift.io/v2/documents?filter[status]=done

Filter for done documents that were not yet exported

https://api.parashift.io/v2/documents?filter[exported_at_blank]=true&filter[status]=done

The resulting list is then often processed (e.g. fetch field data for documents)

https://api.parashift.io/v2/documents/123456/?include=document_fields&extra_fields[document_fields]=extraction_candidates

and documents then marked as exported which excludes them from the original query

https://api.parashift.io/v2/documents/123456/mark_as_exported

Count documents awaiting Extraction Validation

https://api.parashift.io/v2/documents?filter[workflow_step]=extraction_validation&filter[workflow_status]=in_progress&stats[total]=count    

 

Recommended Reading

I strongly recommend the following articles, going into detail about how a batch is related to documents, pages and especially input_files and the batch schema as well as document and field relation. Also how to upload documents to skip workflow steps and more.


Also, check out our Postman API Documentation