No output for OCRmyPDF

Question

I'm using OCRmyPDF to extract text form scanned pdf files. I use codes from this Colab notebook for that purpose. The only difference is that instead of downloading the pdf file from an online url, I use the pdf file stored on my local machine (replaced it {file_name} instead of {invoice_pdf}). Everything looks fine up to the point I run:

os.system(f'ocrmypdf {file_name} output.pdf')

Instead of 0, I get 512! and the next line, when I run !ocrmypdf Performance Evaluations.pdf output.pdf , I get an unrecognized error message which reads like:

usage: ocrmypdf [-h] [-l LANGUAGE] [--image-dpi DPI]
                [--output-type {pdfa,pdf,pdfa-1,pdfa-2}] [--sidecar [FILE]]
                [--version] [-j N] [-q] [-v [VERBOSE]] [--title TITLE]
                [--author AUTHOR] [--subject SUBJECT] [--keywords KEYWORDS]
                [-r] [--remove-background] [-d] [-c] [-i] [--oversample DPI]
                [-f] [-s] [--skip-big MPixels] [--max-image-mpixels MPixels]
                [--tesseract-config CFG] [--tesseract-pagesegmode PSM]
                [--tesseract-oem MODE]
                [--pdf-renderer {auto,tesseract,hocr,sandwich}]
                [--tesseract-timeout SECONDS]
                [--rotate-pages-threshold CONFIDENCE]
                [--pdfa-image-compression {auto,jpeg,lossless}]
                [--user-words FILE] [--user-patterns FILE] [--skip-repair]
                [-k] [-g] [--flowchart FLOWCHART]
                input_pdf_or_image output_pdf
ocrmypdf: error: unrecognized arguments: output.pdf

Finally, running the following line:

with pdfplumber.open('output.pdf') as pdf:
    page = pdf.pages[0]
    text = page.extract_text(x_tolerance=2)
    print(text)

returns

FileNotFoundError                         Traceback (most recent call last)
 in ()
----> 1 with pdfplumber.open('output.pdf') as pdf:
      2     page = pdf.pages[0]
      3     text = page.extract_text(x_tolerance=2)
      4     print(text)

/usr/local/lib/python3.6/dist-packages/pdfplumber/pdf.py in open(cls, path_or_fp, **kwargs)
     56     def open(cls, path_or_fp, **kwargs):
     57         if isinstance(path_or_fp, (str, pathlib.Path)):
---> 58             fp = open(path_or_fp, "rb")
     59             inst = cls(fp, **kwargs)
     60             inst.close = fp.close

FileNotFoundError: [Errno 2] No such file or directory: 'output.pdf'

Any help is appreciated. Thanks

Alex Alex · Accepted Answer

If the file name contains spaces, then you need to enclose the name in quotation marks.

ocrmypdf "Performance Evaluations.pdf" output.pdf

or

ocrmypdf 'Performance Evaluations.pdf' output.pdf

No output for OCRmyPDF

Answers (1)

Related Questions