Jerry Kaiser
Jerry Kaiser

Reputation: 33

Labelling multi page documents with Google Doc AI Workbench

I'm trying to label multi page document to train a custom processor for Google Doc AI. The one issue that I cannot seem to figure out, is how to handle places where a single field, let's say a detailed description of an item, happens to cross 2 pages. So half of it is on page 2, the other half is on page 3.

I cannot find any information about how to handle this situation, hoping someone else has done something similar!

Using the Google Workbench, the individual labels that you apply seem to be very specific to page, so there is no way I can find to tell it that sometimes data will be split between pages.

This does happen quite often with these documents, thankfully though the data is quite organized despite these description fields (which are usually only 20 words, but could be 100 or more). I think that makes a custom processor potentially a good solution for this, but I don't know how to 'explain' this part to it.

Upvotes: 3

Views: 827

Answers (1)

Holt Skinner
Holt Skinner

Reputation: 2234

Currently, the recommended method for handling multi-line fields (including spread across multiple pages) is to have a separate label/entity for each line, then concatenate the entities in post-processing.

Example: Create different labels DESCRIPTION_LINE1, DESCRIPTION _LINE2, etc. and label each document using the multiple lines. Then, in post processing, you can concatenate DESCRIPTION_LINE1 + DESCRIPTION _LINE2 + DESCRIPTION_LINE3... to store the data after the document has been processed by the Custom Document Extractor.

Upvotes: -1

Related Questions