Reputation: 409
I'm using OCRmyPDF to extract text form scanned pdf files. I use codes from this Colab notebook for that purpose. The only difference is that instead of downloading the pdf file from an online url, I use the pdf file stored on my local machine (replaced it {file_name} instead of {invoice_pdf}). Everything looks fine up to the point I run:
os.system(f'ocrmypdf {file_name} output.pdf')
Instead of 0, I get 512! and the next line, when I run !ocrmypdf Performance Evaluations.pdf output.pdf
, I get an unrecognized error message which reads like:
usage: ocrmypdf [-h] [-l LANGUAGE] [--image-dpi DPI]
[--output-type {pdfa,pdf,pdfa-1,pdfa-2}] [--sidecar [FILE]]
[--version] [-j N] [-q] [-v [VERBOSE]] [--title TITLE]
[--author AUTHOR] [--subject SUBJECT] [--keywords KEYWORDS]
[-r] [--remove-background] [-d] [-c] [-i] [--oversample DPI]
[-f] [-s] [--skip-big MPixels] [--max-image-mpixels MPixels]
[--tesseract-config CFG] [--tesseract-pagesegmode PSM]
[--tesseract-oem MODE]
[--pdf-renderer {auto,tesseract,hocr,sandwich}]
[--tesseract-timeout SECONDS]
[--rotate-pages-threshold CONFIDENCE]
[--pdfa-image-compression {auto,jpeg,lossless}]
[--user-words FILE] [--user-patterns FILE] [--skip-repair]
[-k] [-g] [--flowchart FLOWCHART]
input_pdf_or_image output_pdf
ocrmypdf: error: unrecognized arguments: output.pdf
Finally, running the following line:
with pdfplumber.open('output.pdf') as pdf:
page = pdf.pages[0]
text = page.extract_text(x_tolerance=2)
print(text)
returns
FileNotFoundError Traceback (most recent call last)
<ipython-input-19-8274f7005856> in <module>()
----> 1 with pdfplumber.open('output.pdf') as pdf:
2 page = pdf.pages[0]
3 text = page.extract_text(x_tolerance=2)
4 print(text)
/usr/local/lib/python3.6/dist-packages/pdfplumber/pdf.py in open(cls, path_or_fp, **kwargs)
56 def open(cls, path_or_fp, **kwargs):
57 if isinstance(path_or_fp, (str, pathlib.Path)):
---> 58 fp = open(path_or_fp, "rb")
59 inst = cls(fp, **kwargs)
60 inst.close = fp.close
FileNotFoundError: [Errno 2] No such file or directory: 'output.pdf'
Any help is appreciated. Thanks
Upvotes: 1
Views: 3458
Reputation: 2018
If the file name contains spaces, then you need to enclose the name in quotation marks.
ocrmypdf "Performance Evaluations.pdf" output.pdf
or
ocrmypdf 'Performance Evaluations.pdf' output.pdf
Upvotes: 1