Relationships & Structure of Batches, Documents, Pages and Files

Relation between batches, documents, pages and files to trace how many pages are in a document, how many documents in a batch and in which order. Which incoming files led to which batch, document and page, download output files.

Introduction

The Parashift Platform can handle all sorts of different scenarios, from uploading batches with multiple documents consisting of multiple files each to separating batches into documents and then rotating, sorting and deleting pages inside of these documents.

An output file can be created for each document, and internally the platform also creates various files, such as a colored jpeg for display in our viewers, a thumbnail for lists (and other non-public file types used for our machine learning predictions 😉)

This article describes how to trace all of these different changes and relationships between the different objects.

Relationships & Structure

In a nutshell, the following diagram shows how the different objects (batches, documents, pages and files) are linked with each other. 

Relationships & Structure of Batches, Documents, Pages and Files_2022-03-09

Main takeaways

  • One batch consists of one or multiple documents, while a document is always linked to one single batch.
  • One document consists of one or multiple pages, while a page is always linked to one single document.
  • One file (input_file, output_file, color_jpeg) is linked to one single record (batch, document, page), while one record can have multiple files of the same or different type.
    A generalized "Files" object stores all the different types of files, uploaded or generated by the platform. Each file has one main record that it is linked to, identifiable through the record_type (either batches, documents, pages) and record_id (either batch_id, document_id or page_id)
file_type record_type record_id
input_file batches <batch_id>
output_file documents <document_id>
color_jpeg pages <page_id>
  •  "index" or "object_index" is used throughout the different objects to show the order/rank of one object inside the other, the value is zero-based
    • documents.batch_index -> order of this document inside the corresponding batch
    • pages.document_index -> order of this page inside the corresponding document
    • pages.input_file_index -> the page number of the originally uploaded input_file from which this page was created (e.g. originally uploaded 20-page pdf as input_file, pages.input_file_index = "12"  -> this document page was created based on the 13th page of the originally uploaded input_file)
    • files.index -> order of this file in the linked record, depending on file_type & record_type:
file_type record_type index
input_file batches order of this input_file inside the corresponding batch
output_file documents order of this output_file inside the corresponding document, fixed to "0" since one document (at the moment) only has one output_file
color_jpeg pages order of this color_jpeg inside the corresponding page, fixed to "0" since one page only has one color_jpeg

Example API Calls

Ordered amount of pages in document

GET /pages/?filter[document_id]=123456&stats[total]=count&sort=document_index

Pages in a batch, corresponding input_file per page

GET /batches/123456/?include=input_files&extra_fields[files]=url

The output_file (created PDF) of a document

GET /documents/123456/?include=output_files&extra_fields[files]=url

(the URL extra_field is needed to get a temporary accessible download URL)

 All output_files (created PDF) of a batch

GET /documents/?filter[batch_id]=123456&include=output_files&extra_fields[files]=url

(the URL extra_field is needed to get a temporary accessible download URL)

All color_jpeg pages of a given document

GET /pages/?filter[document_id]=123456&include=color_jpeg

All color_jpeg files of a given document

GET /pages/?filter[document_id]=123456&include=color_jpeg&extra_fields[files]=url

(the URL extra_field is needed to get a temporary accessible download URL)

All input_files of a given batch

GET /files/?filter[record_id]=&filter[file_type]=input_file&extra_fields[files]=url

(the URL extra_field is needed to get a temporary accessible download URL)