How to extract HOCR from searchable PDF that used non-Latin script

Question

I have a PDF that was rendered searchable via OCR. The OCR created systematic errors because the PDF was written in a non-Latin script. I want to extract the HOCR file from the PDF, and then correct the HOCR, and then reinsert the fixed OCR back to the PDF. But I keep having trouble extracting the HOCR from the original searchable PDF.

I found very few past discussions on how to do that, but they didn't work for me. Using ocrodjvu kept failing for me because of environment problems.

I'm running python 3.7 on a Mac.

Update: To make it clearer, here is an example page out of 800 paged PDF. The URL will expire within a year so here is a simple screenshot. The PDF is an Armenian-Armenian dictionary. The first line reads as

ԲԱՌԱԿԱԶՄԸ։ Բառարանի մեջ ընդգրկվել են բոլոր

But if you open the pdf on a browser, and copy this line, then you get a weird encoding

´²è²Î²¼ØÀ: ´³é³ñ³ÝÇ Ù»ç ÁÝ¹·ñÏí»É »Ý μáÉáñ

Strangely, if you open it on a PDF reader, you end up getting extra spaces. But that's a separate issue

´² è² Î²¼ ØÀ: ´³ é³ ñ³ ÝÇ Ù»ç ÁÝ¹·ñÏ í»É »Ý μá Éáñ

By looking at the PDF, I saw that the Armenian letter Ա was systematically encoded as ². My plan was to

extract hOCR from the PDF (with position data)
convert the weird symbols into Armenian

reinsert the hOCR back into the PDF. A friend of mine gave me Python code that does (1) using fitz

import fitz

def pdf_to_hocr(pdf_path): doc = fitz.open(pdf_path) hocr = ""

 for page_num in range(len(doc)):
     print(page_num)
     page = doc.load_page(page_num)
     blocks = page.get_text("dict")["blocks"]

     for b in blocks:  # iterate through the text blocks
         if b['type'] == 0:  # block contains text
             for line in b["lines"]:
                 bbox = fitz.Rect(line["bbox"]).irect  # get the bbox of the line
                 line_text = "".join([span["text"] for span in line["spans"]])
                 # Format line in HOCR
                 hocr_line = f'{line_text}
'
                 hocr += hocr_line

 return hocr

With a simple output

pdf_path = "originalPDF.pdf"
hocr_output = pdf_to_hocr(pdf_path)
with open("originalHOCR.hocr", "w") as file:
    file.write(hocr_output)

But the problem is that the fitz package creates an hOCR file that uses the extra spaces. So the hOCR for the first line that you see is:

´² è² Î²¼ ØÀ: ´³ é³ ñ³ ÝÇ Ù»ç ÁÝ¹·ñÏ í»É »Ý µá Éáñ

The span has "´² è² Î²¼ ØÀ: ´³ é³ ñ³ ÝÇ Ù»ç ÁÝ¹·ñÏ í»É »Ý µá Éáñ" with the extra spaces. I'm now trying to find code that will generate an hOCR like the above but without the extra spaces.

Update 2: Fixed the file URL

How to extract HOCR from searchable PDF that used non-Latin script

Answers (1)

Part 2

Related Questions