Reputation: 11
I am trying to include the page number in the configuration JSON. I tried some ways, but no one works. Looking at the converter code on the GitHub page, I saw a lot of mentions of "page_number", so I think it's possible.
Also, there's some documentation on how to create the config files? I only found some examples on the GitHub page.
I created an intermediary JSON file from the original JSON that I want to convert, to be easier the config file creation. The JSON that I want to convert is like this (but I can change the positions of the information):
{
"document": "document_name.pdf",
"labels": [
{
"label": "label_1",
"text": "some_text",
"page": 1,
"boundingBoxes": [
0.11554117647058823,
0.17014227068750512,
0.32081176470588235,
0.1710069997542349,
0.32081176470588235,
0.1818570739388864,
0.11554117647058823,
0.18099234487215662
]
},
{
"label": "label_2",
"text": "some_text",
"page": 2,
"boundingBoxes": [
0.269359477124183,
0.5985227851890204,
0.386081045751634,
0.5980899758112141,
0.386081045751634,
0.6102450429168925,
0.269359477124183,
0.6098081885916301
]
}
]
}
And I tried these below config files, but in all, the boundbox for page 2 or more, was made on the first page.
{
"entity_object":"labels",
"page": {
"pageNumber":"page"
},
"entity": {
"type_":"label",
"mention_text":"text",
"normalized_vertices":{
"type":"3",
"unit":"normalized",
"base":"boundingBoxes",
"x":"x",
"y":"y"
}
}
}
{
"entity_object":"labels",
"page": {
"page_number":"page"
},
"entity": {
"type_":"label",
"mention_text":"text",
"normalized_vertices":{
"type":"3",
"unit":"normalized",
"base":"boundingBoxes",
"x":"x",
"y":"y"
}
}
}
{
"entity_object":"labels",
"entity": {
"type_":"label",
"mention_text":"text",
"pageNumber":"page",
"normalized_vertices":{
"type":"3",
"unit":"normalized",
"base":"boundingBoxes",
"x":"x",
"y":"y"
}
}
}
{
"entity_object":"labels",
"entity": {
"type_":"label",
"mention_text":"text",
"page_number":"page",
"normalized_vertices":{
"type":"3",
"unit":"normalized",
"base":"boundingBoxes",
"x":"x",
"y":"y"
}
}
}
Upvotes: 0
Views: 259
Reputation: 5906
The problem is that converter ignores page numbers, and the only reference to them left is the one inside pageAnchor
, which is set to zero by default, so that each entity is being put on the first page.
To fix this you can modify the converter code inside _get_entity_content
function so that it looks like this:
docai_entity.page_anchor = documentai.Document.PageAnchor(
page_refs=[documentai.Document.PageAnchor.PageRef(
bounding_poly=bounding_box, page=page_number)])
Upvotes: 0
Reputation: 2234
I don't think that the converter currently reads page numbers from the configuration JSON or the input files. So I think with the current design, all of the entities will show up on the first page.
I've been working on trying to refactor the converter tool to make it support more use cases and structures, but time has been challenging to find.
There are also sample config.json files in this directory as examples
https://github.com/googleapis/python-documentai-toolbox/tree/main/samples/sample-converter-configs
And based on the information in the sample code:
https://cloud.google.com/document-ai/docs/toolbox#third-party-conversion
Upvotes: 0