Julia Grobman
Julia Grobman

Reputation: 11

Paragraph numbering in Document AI OCR

could you possibly help me: I have a pdf in Hebrew with numerated paragraphs inside. After processing this pdf with Google Document AI OCR API, I receive text, where paragraph numbering always goes before actual text:this is an example of paragraphs numeration before paragraphs text Is it possible to solve this problem?

I tried examining lines and tokens layout of the json, returned by Document AI, but the layout reflects the problem, the numbers are not in the correct place

`# documents - output of the Documents API
for document in documents:
    for page in document.pages:
       for line in page.lines:
           if page.page_number <=10:
              layout = line.layout
              text_anchor = layout.text_anchor
              start_index = text_anchor.text_segments[0].start_index
              end_index = text_anchor.text_segments[0].end_index
              line_text = document.text[start_index:end_index]
              print(line_text)

`

I was previously trying Google Vision AI and have also tried different documents, and all the time there was the same error.

Thank you!

Upvotes: 1

Views: 244

Answers (1)

Holt Skinner
Holt Skinner

Reputation: 2234

That's some interesting behavior. Just to clarify, the text looks something like this in the original document? (It would be helpful if you can provide a redacted example document and what you would expect the output to be)

.10 [hebrew text1]
.11 [text2]
etc.

But the output is like:

.10
.11
[hebrew text 1]
[hebrew text 2]

My hypothesis is that this could be an issue with how Document AI handles this type of input for right-to-left languages (like Hebrew). If that's the case, this can be reported to the product development team. But it will be difficult to tell without an input document and the expected output.

For your specific use case, it could also make sense to use the Form Parser if you're interested in extracting specific fields based on those numbers. Processor version pretrained-form-parser-v2.0-2022-11-10 added support for all of the languages supported by Document OCR

Upvotes: 0

Related Questions