Reputation: 109
Need to convert the scanned pdf to docx document .The approach I have used so far 1.Converting the scanned pdf to Searchable pdf using pytessaract pytesseract.image_to_pdf_or_hocr() 2. then convert that searchable pdf to docx using lowriter 'lowriter --invisible --convert-to docx"{}"
But this results to formatting and layout issue in the docx /doc and there is a overlapping of text and image in resultant docx file. Please help
Upvotes: 1
Views: 2781
Reputation: 324
you can use pythons, pdfminer to convert your pdf to txt, this will be better than tesseract in terms of memory, it takes in al text data, but loses formatting, you can then convert this txt file to Docx using python-Docx
from docx import Document
import re
import os
path = 'your path'
direct = os.listdir(path)
for i in direct:
document = Document()
document.add_heading(i, 0)
myfile = open('/path/to/read/from/'+i).read()
myfile = re.sub(r'[^\x00-\x7F]+|\x0c',' ', myfile) # remove all non-XML-compatible
characters
p = document.add_paragraph(myfile)
document.save('/path/to/write/to/'+i+'.docx')
or maybe you can convert the document to XML and read that way, you probably can sav the formatting to by comparing the font sizes,
GroupDocs.Conversion Cloud, offers Python SDK for Text/PDF to DOC/DOCX conversion and many other common files formats from one format to another, without depending on any third-party tool or software.
Upvotes: 1