Reputation: 31
I have a PDF that was rendered searchable via OCR. The OCR created systematic errors because the PDF was written in a non-Latin script. I want to extract the HOCR file from the PDF, and then correct the HOCR, and then reinsert the fixed OCR back to the PDF. But I keep having trouble extracting the HOCR from the original searchable PDF.
I found very few past discussions on how to do that, but they didn't work for me. Using ocrodjvu kept failing for me because of environment problems.
I'm running python 3.7 on a Mac.
Update: To make it clearer, here is an example page out of 800 paged PDF. The URL will expire within a year so here is a simple screenshot. The PDF is an Armenian-Armenian dictionary. The first line reads as
ԲԱՌԱԿԱԶՄԸ։ Բառարանի մեջ ընդգրկվել են բոլոր
But if you open the pdf on a browser, and copy this line, then you get a weird encoding
´²è²Î²¼ØÀ: ´³é³ñ³ÝÇ Ù»ç Áݹ·ñÏí»É »Ý μáÉáñ
Strangely, if you open it on a PDF reader, you end up getting extra spaces. But that's a separate issue
´² è² Î²¼ ØÀ: ´³ é³ ñ³ ÝÇ Ù»ç Áݹ·ñÏ í»É »Ý μá Éáñ
By looking at the PDF, I saw that the Armenian letter Ա was systematically encoded as ². My plan was to
extract hOCR from the PDF (with position data)
convert the weird symbols into Armenian
reinsert the hOCR back into the PDF. A friend of mine gave me Python code that does (1) using fitz
import fitz
def pdf_to_hocr(pdf_path): doc = fitz.open(pdf_path) hocr = ""
for page_num in range(len(doc)):
print(page_num)
page = doc.load_page(page_num)
blocks = page.get_text("dict")["blocks"]
for b in blocks: # iterate through the text blocks
if b['type'] == 0: # block contains text
for line in b["lines"]:
bbox = fitz.Rect(line["bbox"]).irect # get the bbox of the line
line_text = "".join([span["text"] for span in line["spans"]])
# Format line in HOCR
hocr_line = f'<span class="ocr_line" id="line_{page_num}_{bbox}" title="bbox {bbox.x0} {bbox.y0} {bbox.x1} {bbox.y1}">{line_text}</span>\n'
hocr += hocr_line
return hocr
With a simple output
pdf_path = "originalPDF.pdf"
hocr_output = pdf_to_hocr(pdf_path)
with open("originalHOCR.hocr", "w") as file:
file.write(hocr_output)
But the problem is that the fitz package creates an hOCR file that uses the extra spaces. So the hOCR for the first line that you see is:
<span class="ocr_line" id="line_0_IRect(71, 100, 292, 111)" title="bbox 71 100 292 111">´² è² Î²¼ ØÀ: ´³ é³ ñ³ ÝÇ Ù»ç Áݹ·ñÏ í»É »Ý µá Éáñ</span>
The span has "´² è² Î²¼ ØÀ: ´³ é³ ñ³ ÝÇ Ù»ç Áݹ·ñÏ í»É »Ý µá Éáñ" with the extra spaces. I'm now trying to find code that will generate an hOCR like the above but without the extra spaces.
Update 2: Fixed the file URL
Upvotes: 2
Views: 338
Reputation: 11867
TL;DR this is a problem about correcting a PDF CMAP internally, NOT via OCR. However, without the source PDF, the initial answer focused on OCR methods. So for others with this programming challenge, skip to part 2, or use the following inferior OCR methods.
"hOCR" is HTML from OCR and does not exist inside a PDF it is generated from OCR. Thus it does not make sense to correct a previous poor insertion by extract as plain text, nor try to fix it (but see part 2 of this answer), since many OCR characters may not exist. Here we can see similar to when OCR has replaced each letter (this is normal when a word processing dictionary has not been used correctly) with a variety of mixed fonts.
From high quality images such as PNG, you can get a reasonable output from OCR. Note this file does not have OCR so needed in page print conversion to image first (destroys the source OCR) then a dictionary based OCR in a single font.
OCR from scanned image PDF, is often not as good as OCR direct from images, where you control image quality first.
Here using MuTool Windows-Tesseract build (unknown if there is a Mac Tesseract variant you would need to see if it is in the GZ sources https://mupdf.com/releases/index.html ).
Whichever Tesseract version you will likely need the related Armenian data dependency. In this case it is easy.
curl -O https://raw.githubusercontent.com/tesseract-ocr/tessdata/main/hye.traineddata
that should download 3509k (3,594,112 bytes) hye.traineddata then exec on a sample Artifex Mutool. (It can generate a form of hOCR but avoid that by start afresh)
mutool draw -o armenian-out.pdf -t hye -d . -F ocr.pdf armenian.pdf
ԲԱՌԱԿԱԶՄԸ: Բառարանի մեջ ընդգրկվել են բոլոր
այն բառերը եւ դարբվածները, որոնք գործածական են
Ղարաբաղի բարբառի
խոսվածքներում: Շաղախի,
Շահումյանի
եւ Գորիսի ենթաբարբառների բառա-
պաշարին անդրադարձել ենք մասնակիորեն` հաշվի
առնելով դրանց` բուն Ղարաբաղի բարբառից ունեցած
որոշ էական տարբերությունները, որոնք հիմնակա-
նում այլ բարբառների ազդեցության արդյունք են:
Result from an Image using an online OCR but you can do it yourself (see image below).
ԲԱՌԱԿԱԶՄԸ։ Բառարանի մեջ ընդգրկվել են բոլոր
այն բառերը եւ դարձվածները, որոնք գործածական են
Ղարաբաղի բարբառի խոսվածքներում։ Շաղախի,
Շահումյանի եւ Գորիսի ենթաբարբառների բառա-
պաշարին անդրադարձել ենք մասնակիորեն՝ հաշվի
առնելով դրանց՝ բուն Ղարաբաղի բարբառից ունեցած
որոշ էական տարբերությունները, որոնք հիմնակա
նում այլ բարբառների ազդեցության արդյունք են։
Again using Windows as example.
set "TESSDATA_PREFIX=%CD%"
Tesseract --list-langs
replies
List of available languages (2):
eng
hye
Tesseract P8ZmT.png outputbase -l hye
Notepad outputbase.txt
So from the supplied reproduction we can see the source has no images it is simply a badly encoded PDF. This again is very common but more likely to be solved by digital rather than analog methods.
The first Question when converting a PDF is give me the source sample, then we can check if the CMAP is present. It this case it appears to be missing thus we need to use a character substitution that replicates the missing ANSI To Unicode
conversion, such as here.
´ ² è ² Î ² ¼ Ø À :
B4 B2 E8 B2 CE B2 BC D8 C0 3A
0532 0531 054C 0531 053F 0531 0536 0544 0538 003A
Բ Ա Ռ Ա Կ Ա Զ Մ Ը ։
For some PDFs in this state it is possible to remap one byte characters to another byte easily but very slowly (a table of changes as shown will help)
Original
Correcting by remapping bold font as a test.
So here we see that first character (leftmost in this table) still needs conversion from B4 to 0532
Note it does not resolve the spacing issue
The first hurdle is that is not easy to retro fit into an existing PDF. Few PDF tools will attempt such a large task especially when there are usually only 256 ANSI characters for a Unicode range of working up towards 65,536 characters.
I am therefore having to suggest, the best route is probably PyMuPDF where it has in the past proved very capable of such a task. However it is too big an issue in this case, to describe here. So best answer is discuss the bigger CMAP issue with @JorjMcKie via https://discord.gg/63t35Smg
Upvotes: 1