Reputation: 21
I am working with Pytesseract and would like to convert an HOCR output to a string. Of course, such a function is implemented into Pytesseract but I would like to know more about the possible strategies to get it done thx
from pytesseract import image_to_pdf_or_hocr
hocr_output = image_to_pdf_or_hocr(image, extension='hocr')
Upvotes: 2
Views: 1584
Reputation: 156
Since hOCR is a type of .xml we can use a .xml parser.
But first we need to convert the binary output of tesseract to str:
from pytesseract import image_to_pdf_or_hocr
hocr_output = image_to_pdf_or_hocr(image, extension='hocr')
hocr = hocr_output.decode('utf-8')
Now we can use xml.etree to parse it:
import xml.etree.ElementTree as ET
root = ET.fromstring(hocr)
xml.etree provides us with a text iterator whose result we can join in a single string:
text = ''.join(root.itertext())
Upvotes: 1