ECHO
ECHO

Reputation: 119

How do I convert scanned PDF into searchable PDF in Python (Mac)? e.g. OCRMYPDF module

I am writing a program in python that can read pdf document, extract text from the document and rename the document using extracted text. At first, the scanned pdf document is not searchable. I would like to convert the pdf into searchable pdf on Python instead of using Google doc, Cisdem pdf converter.

I have read about ocrmypdf module which can used to solve this. However, I do not know how to write the code due to my limited knowledge.

I expect the output to convert the scanned pdf into searchable pdf.

Upvotes: 5

Views: 13947

Answers (3)

Hirschdude
Hirschdude

Reputation: 145

I suggest you go through the tutorial, it will take you some time but it should be worth it.

I'm not sure what you exactly want. In my project the settings below work fine in most cases.

import ocrmypdf , tesseract
def ocr(file_path, save_path):
    ocrmypdf.ocr(file_path, save_path, rotate_pages=True,
    remove_background=True,language="en", deskew=True, force_ocr=True)

Upvotes: 6

Moses Noel
Moses Noel

Reputation: 25

This can be done with two steps:

  1. Create Python OCR function
import ocrmypdf

def ocr(file_path, save_path):
   ocrmypdf.ocr(file_path, save_path)
  1. Call and use the function.
ocr("input.pdf","output.pdf")

Upvotes: 0

Ajay Verma
Ajay Verma

Reputation: 1

I have also faced the same issues with scanned pdf files. I found a solution to handle this with these 3 lines of code. This code can convert a scanned pdf document into a searchable and select a text in pdf document.

import ocrmypdf
def scannedPdfConverter(file_path, save_path):
    ocrmypdf.ocr(file_path, save_path, skip_text=True)
    print('File converted successfully!')

Upvotes: 0

Related Questions