Azure Form Recognizer duplicating text extracted from PDF

Question

While extracting values using Azure Form Recognizer, many values are shown duplicated.

I have trained a custom model labelling the appropriate key values. I find that the OCR duplicates the boxes, so that when I am labelling using the sample labeling tool I often get one box inside the other.I need to pick one and deselect the other, to avoid showing the value duplicated.

When I run the model to predict a new PDF for many keys I also get the values duplicated.

Furthermore, upon inspection of the Result JSON I can see that many Lines have the Bounded Boxes nested, or overlapping. That is, typically you would have a Line that has a bounded box and text associated that in turn have "Words" that have a bounded box inside the Bounded Box of the Line.

Just to clarify, in the JSON I am seeing Lines, that have overlapping or nested Bounded Boxes and therefore text.

Any clues as to why this can be?

Azure Form Recognizer duplicating text extracted from PDF

Answers (1)

Related Questions