Reputation: 11
I am trying to use the converter from document ai to converter some JSONs to Document AI JSON format. Using the function described in this documentation:
https://cloud.google.com/document-ai/docs/samples/documentai-toolbox-convert-external-annotations
I used the example config.json
from the Github page:
And the JSON annotations from form recognizer and the PDF in the attachment (only replace "pxl"
with "inch"
, because this is used in the JSON annotations).
To processor_id
, I tried an Invoice Parser processor, one new, and another that has some models that I fine-tuned, also tried the processor version ID of some trained models and also the pretrained-invoice-v1.3-2022-07-15
.
The input bucket have this format gs://convertion_input_test/azure_test/
and I put three files in the azure_test folder (sample-invoice.pdf
, sample-invoice_annotations.json
and sample-invoice_config.json
). The output bucket is gs://convertion_output_test/azure_test/
.
When I run the convert_external_annotations_sample()
functions in all cases, I receive this output:
-------- Downloading Started --------
-------- Finished Downloading --------
-------- Converting Started --------
-------- Finished Converting --------
-------- Uploading Started --------
-------- Finished Uploading --------
-------- Schema Information --------
Unique Entity Types: []
And nothing is saved in the output bucket.
There are some configurations that I did wrong? I checked and the json annotation have all fields used in the config json, but I need to change something in this file?
The pdf file is like:
The sample-invoice_config.json
:
{
"entity_object":"analyzeResult.documentResults.0.fields",
"page": {
"height":"analyzeResult.readResults.0.height",
"width":"analyzeResult.readResults.0.width"
},
"entity": {
"type_":"analyzeResult.documentResults.0.fields:self",
"mention_text":"text",
"normalized_vertices":{
"type":"3",
"unit":"inch",
"base":"boundingBox",
"x":"x",
"y":"y"
}
}
}
And a part of the sample-invoice_annotations.json
:
{
"status": "succeeded",
"createdDateTime": "2020-11-06T23:32:11Z",
"lastUpdatedDateTime": "2020-11-06T23:32:20Z",
"analyzeResult": {
"version": "2.1.0",
"readResults": [{
"page": 1,
"angle": 0,
"width": 8.5,
"height": 11,
"unit": "inch"
}],
"pageResults": [{
"page": 1,
"tables": [{
"rows": 3,
"columns": 4,
"cells": [{
"rowIndex": 0,
"columnIndex": 0,
"text": "QUANTITY",
"boundingBox": [0.4953,
5.7306,
1.8097,
5.7306,
1.7942,
6.0122,
0.4953,
6.0122]
},
{
"rowIndex": 0,
"columnIndex": 1,
"text": "DESCRIPTION",
"boundingBox": [1.8097,
5.7306,
5.7529,
5.7306,
5.7452,
6.0122,
1.7942,
6.0122]
},
{
"rowIndex": 0,
"columnIndex": 2,
"text": "UNIT PRICE",
"boundingBox": [5.7529,
5.7306,
6.8045,
5.7306,
6.8122,
6.0122,
5.7452,
6.0122]
},
......
......
......
"VendorName": {
"type": "string",
"valueString": "CONTOSO LTD.",
"text": "CONTOSO LTD.",
"boundingBox": [0.5909,
0.6827,
2.3215,
0.6827,
2.3215,
0.8644,
0.5909,
0.8644],
"page": 1,
"confidence": 0.998
}
}
}]
}
}
Upvotes: 0
Views: 594
Reputation: 2234
First of all, thank you for using the Document AI Toolbox SDK and providing feedback. I do want to note that it is in the development stage and there may be backwards-incompatible changes made before the library's 1.0.0
release.
A couple of items I noticed:
You don't actually need to use the Document AI Invoice Parser for the converter tool. You just need to use the Document OCR processor to get the OCR data.
It's difficult to tell without the full sample-invoice_annotations.json
file, but it looks like the entities such as "VendorName"
are nested under pageResults
instead of documentResults
as the sample-invoice_config.json
is set up. Could you provide a link to the full sample-invoice_annotations.json
file and the original PDF file you were using? (The PNG seems to be compressed quality and may not work the same)
Upvotes: 1