Reputation: 11
While extracting values using Azure Form Recognizer, many values are shown duplicated.
I have trained a custom model labelling the appropriate key values. I find that the OCR duplicates the boxes, so that when I am labelling using the sample labeling tool I often get one box inside the other.I need to pick one and deselect the other, to avoid showing the value duplicated.
When I run the model to predict a new PDF for many keys I also get the values duplicated.
Furthermore, upon inspection of the Result JSON I can see that many Lines have the Bounded Boxes nested, or overlapping. That is, typically you would have a Line that has a bounded box and text associated that in turn have "Words" that have a bounded box inside the Bounded Box of the Line.
Just to clarify, in the JSON I am seeing Lines, that have overlapping or nested Bounded Boxes and therefore text.
Any clues as to why this can be?
Upvotes: 1
Views: 422
Reputation: 312
I wonder if you could show a sample of the pdf file you used. When you use the sample pdf documents, such problem didn't happen, right? sample data file could be found here: https://github.com/Azure-Samples/cognitive-services-REST-api-samples/blob/master/curl/form-recognizer/sample_data.zip
Upvotes: 0