Augusto Firmo
Augusto Firmo

Reputation: 11

How to use the converter from GCP Document AI

I am trying to use the converter from document ai to converter some JSONs to Document AI JSON format. Using the function described in this documentation:

https://cloud.google.com/document-ai/docs/samples/documentai-toolbox-convert-external-annotations

I used the example config.json from the Github page:

https://github.com/googleapis/python-documentai-toolbox/blob/d29ff95742269a95e1e96e047f0fa1268457292a/samples/sample-converter-configs/Azure/invoice-config.json

And the JSON annotations from form recognizer and the PDF in the attachment (only replace "pxl" with "inch", because this is used in the JSON annotations).

To processor_id, I tried an Invoice Parser processor, one new, and another that has some models that I fine-tuned, also tried the processor version ID of some trained models and also the pretrained-invoice-v1.3-2022-07-15.

The input bucket have this format gs://convertion_input_test/azure_test/ and I put three files in the azure_test folder (sample-invoice.pdf, sample-invoice_annotations.json and sample-invoice_config.json). The output bucket is gs://convertion_output_test/azure_test/.

When I run the convert_external_annotations_sample() functions in all cases, I receive this output:

-------- Downloading Started --------
-------- Finished Downloading --------
-------- Converting Started --------
-------- Finished Converting --------
-------- Uploading Started --------
-------- Finished Uploading --------
-------- Schema Information --------
Unique Entity Types: []

And nothing is saved in the output bucket.

There are some configurations that I did wrong? I checked and the json annotation have all fields used in the config json, but I need to change something in this file?

The pdf file is like:

sample-invoice.pdf

The sample-invoice_config.json:

{
    "entity_object":"analyzeResult.documentResults.0.fields",
    "page": {
        "height":"analyzeResult.readResults.0.height",
        "width":"analyzeResult.readResults.0.width"
    },
    "entity": {
        "type_":"analyzeResult.documentResults.0.fields:self",
        "mention_text":"text",
        "normalized_vertices":{
            "type":"3",
            "unit":"inch",
            "base":"boundingBox",
            "x":"x",
            "y":"y"
        }
    }
}

And a part of the sample-invoice_annotations.json:

{
    "status": "succeeded",
    "createdDateTime": "2020-11-06T23:32:11Z",
    "lastUpdatedDateTime": "2020-11-06T23:32:20Z",
    "analyzeResult": {
        "version": "2.1.0",
        "readResults": [{
            "page": 1,
            "angle": 0,
            "width": 8.5,
            "height": 11,
            "unit": "inch"
        }],
        "pageResults": [{
            "page": 1,
            "tables": [{
                "rows": 3,
                "columns": 4,
                "cells": [{
                    "rowIndex": 0,
                    "columnIndex": 0,
                    "text": "QUANTITY",
                    "boundingBox": [0.4953,
                    5.7306,
                    1.8097,
                    5.7306,
                    1.7942,
                    6.0122,
                    0.4953,
                    6.0122]
                },
                {
                    "rowIndex": 0,
                    "columnIndex": 1,
                    "text": "DESCRIPTION",
                    "boundingBox": [1.8097,
                    5.7306,
                    5.7529,
                    5.7306,
                    5.7452,
                    6.0122,
                    1.7942,
                    6.0122]
                },
                {
                    "rowIndex": 0,
                    "columnIndex": 2,
                    "text": "UNIT PRICE",
                    "boundingBox": [5.7529,
                    5.7306,
                    6.8045,
                    5.7306,
                    6.8122,
                    6.0122,
                    5.7452,
                    6.0122]
                },

......
......
......

                "VendorName": {
                    "type": "string",
                    "valueString": "CONTOSO LTD.",
                    "text": "CONTOSO LTD.",
                    "boundingBox": [0.5909,
                    0.6827,
                    2.3215,
                    0.6827,
                    2.3215,
                    0.8644,
                    0.5909,
                    0.8644],
                    "page": 1,
                    "confidence": 0.998
                }
            }
        }]
    }
}

Upvotes: 0

Views: 594

Answers (1)

Holt Skinner
Holt Skinner

Reputation: 2234

First of all, thank you for using the Document AI Toolbox SDK and providing feedback. I do want to note that it is in the development stage and there may be backwards-incompatible changes made before the library's 1.0.0 release.

A couple of items I noticed:

  1. You don't actually need to use the Document AI Invoice Parser for the converter tool. You just need to use the Document OCR processor to get the OCR data.

    • The converter should still work as expected with the Invoice Parser, but if you're able to get the expected entities by sending the original documents to that processor, you can likely go without using the converter tool and just save the outputs from the processor to use in training.
    • The main purpose for this converter tool is to convert external annotation formats into the Document JSON format for importing into Document AI Workbench to train custom processors without having to re-label manually.
  2. It's difficult to tell without the full sample-invoice_annotations.json file, but it looks like the entities such as "VendorName" are nested under pageResults instead of documentResults as the sample-invoice_config.json is set up. Could you provide a link to the full sample-invoice_annotations.json file and the original PDF file you were using? (The PNG seems to be compressed quality and may not work the same)

Upvotes: 1

Related Questions