Reputation: 119
I am writing a program in python that can read pdf document, extract text from the document and rename the document using extracted text. At first, the scanned pdf document is not searchable. I would like to convert the pdf into searchable pdf on Python instead of using Google doc, Cisdem pdf converter.
I have read about ocrmypdf module which can used to solve this. However, I do not know how to write the code due to my limited knowledge.
I expect the output to convert the scanned pdf into searchable pdf.
Upvotes: 5
Views: 13947
Reputation: 145
I suggest you go through the tutorial, it will take you some time but it should be worth it.
I'm not sure what you exactly want. In my project the settings below work fine in most cases.
import ocrmypdf , tesseract
def ocr(file_path, save_path):
ocrmypdf.ocr(file_path, save_path, rotate_pages=True,
remove_background=True,language="en", deskew=True, force_ocr=True)
Upvotes: 6
Reputation: 25
This can be done with two steps:
import ocrmypdf
def ocr(file_path, save_path):
ocrmypdf.ocr(file_path, save_path)
ocr("input.pdf","output.pdf")
Upvotes: 0
Reputation: 1
I have also faced the same issues with scanned pdf files. I found a solution to handle this with these 3 lines of code. This code can convert a scanned pdf document into a searchable and select a text in pdf document.
import ocrmypdf
def scannedPdfConverter(file_path, save_path):
ocrmypdf.ocr(file_path, save_path, skip_text=True)
print('File converted successfully!')
Upvotes: 0