aaugsp
aaugsp

Reputation: 11

Azure Form Recognizer duplicating text extracted from PDF

While extracting values using Azure Form Recognizer, many values are shown duplicated.

I have trained a custom model labelling the appropriate key values. I find that the OCR duplicates the boxes, so that when I am labelling using the sample labeling tool I often get one box inside the other.I need to pick one and deselect the other, to avoid showing the value duplicated.

When I run the model to predict a new PDF for many keys I also get the values duplicated.

Furthermore, upon inspection of the Result JSON I can see that many Lines have the Bounded Boxes nested, or overlapping. That is, typically you would have a Line that has a bounded box and text associated that in turn have "Words" that have a bounded box inside the Bounded Box of the Line.

Just to clarify, in the JSON I am seeing Lines, that have overlapping or nested Bounded Boxes and therefore text.

Any clues as to why this can be?

Upvotes: 1

Views: 422

Answers (1)

Xin Zou
Xin Zou

Reputation: 312

I wonder if you could show a sample of the pdf file you used. When you use the sample pdf documents, such problem didn't happen, right? sample data file could be found here: https://github.com/Azure-Samples/cognitive-services-REST-api-samples/blob/master/curl/form-recognizer/sample_data.zip

Upvotes: 0

Related Questions