Yashu Gupta
Yashu Gupta

Reputation: 109

Convert Scanned PDF or tessaract searchable PDF to docx/doc and maintaing all formats and layouts using python

Need to convert the scanned pdf to docx document .The approach I have used so far 1.Converting the scanned pdf to Searchable pdf using pytessaract pytesseract.image_to_pdf_or_hocr() 2. then convert that searchable pdf to docx using lowriter 'lowriter --invisible --convert-to docx"{}"

But this results to formatting and layout issue in the docx /doc and there is a overlapping of text and image in resultant docx file. Please help

Upvotes: 1

Views: 2781

Answers (1)

Akash
Akash

Reputation: 324

you can use pythons, pdfminer to convert your pdf to txt, this will be better than tesseract in terms of memory, it takes in al text data, but loses formatting, you can then convert this txt file to Docx using python-Docx

from docx import Document
import re
import os

path = 'your path'
direct = os.listdir(path)

for i in direct:
    document = Document()
    document.add_heading(i, 0)
    myfile = open('/path/to/read/from/'+i).read()
    myfile = re.sub(r'[^\x00-\x7F]+|\x0c',' ', myfile) # remove all non-XML-compatible 
  characters
    p = document.add_paragraph(myfile)
    document.save('/path/to/write/to/'+i+'.docx')

or maybe you can convert the document to XML and read that way, you probably can sav the formatting to by comparing the font sizes,

GroupDocs.Conversion Cloud, offers Python SDK for Text/PDF to DOC/DOCX conversion and many other common files formats from one format to another, without depending on any third-party tool or software.

Upvotes: 1

Related Questions