Paragraph numbering in Document AI OCR

Question

could you possibly help me: I have a pdf in Hebrew with numerated paragraphs inside. After processing this pdf with Google Document AI OCR API, I receive text, where paragraph numbering always goes before actual text:this is an example of paragraphs numeration before paragraphs text Is it possible to solve this problem?

I tried examining lines and tokens layout of the json, returned by Document AI, but the layout reflects the problem, the numbers are not in the correct place

`# documents - output of the Documents API
for document in documents:
    for page in document.pages:
       for line in page.lines:
           if page.page_number <=10:
              layout = line.layout
              text_anchor = layout.text_anchor
              start_index = text_anchor.text_segments[0].start_index
              end_index = text_anchor.text_segments[0].end_index
              line_text = document.text[start_index:end_index]
              print(line_text)

`

I was previously trying Google Vision AI and have also tried different documents, and all the time there was the same error.

Thank you!

Paragraph numbering in Document AI OCR

Answers (1)

Related Questions