Reputation: 53873
In Python I'm using pdfminer to read the text from a pdf with the code below this message. I now get an error message saying:
File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfpage.py", line 124, in get_pages
raise PDFTextExtractionNotAllowed('Text extraction is not allowed: %r' % fp)
PDFTextExtractionNotAllowed: Text extraction is not allowed: <cStringIO.StringO object at 0x7f79137a1
ab0>
When I open this pdf with Acrobat Pro it turns out it is secured (or "read protected"). From this link however, I read that there's a multitude of services which can disable this read-protection easily (for example pdfunlock.com. When diving into the source of pdfminer, I see that the error above is generated on these lines.
if check_extractable and not doc.is_extractable:
raise PDFTextExtractionNotAllowed('Text extraction is not allowed: %r' % fp)
Since there's a multitude of services which can disable this read-protection within a second, I presume it is really easy to do. It seems that .is_extractable
is a simple attribute of the doc
, but I don't think it is as simple as changing .is_extractable
to True..
Does anybody know how I can disable the read protection on a pdf using Python? All tips are welcome!
================================================
Below you will find the code with which I currently extract the text from non-read protected.
def getTextFromPDF(rawFile):
resourceManager = PDFResourceManager(caching=True)
outfp = StringIO()
device = TextConverter(resourceManager, outfp, codec='utf-8', laparams=LAParams(), imagewriter=None)
interpreter = PDFPageInterpreter(resourceManager, device)
fileData = StringIO()
fileData.write(rawFile)
for page in PDFPage.get_pages(fileData, set(), maxpages=0, caching=True, check_extractable=True):
interpreter.process_page(page)
fileData.close()
device.close()
result = outfp.getvalue()
outfp.close()
return result
Upvotes: 34
Views: 70122
Reputation: 627
pikepdf didn't work for me. I found a solution using PyPDF2 to unencrypt all files in current working directory.
import os
from PyPDF2 import PdfReader, PdfWriter
def remove_encryption_from_pdf(input_path, output_path):
with open(input_path, "rb") as file:
reader = PdfReader(file)
if reader.is_encrypted:
writer = PdfWriter()
for page in reader.pages:
writer.add_page(page)
with open(output_path, "wb") as output_pdf:
writer.write(output_pdf)
if __name__ == "__main__":
directory_path = os.getcwd() # get current directory path
for filename in os.listdir(directory_path):
if filename.endswith('.pdf'):
input_path = os.path.join(directory_path, filename)
output_path = os.path.join(directory_path, "decrypted_" + filename)
print(f"Processing {filename}") # print the file name
try:
remove_encryption_from_pdf(input_path, output_path)
print(f"Encryption removed from {filename}")
except Exception as e:
print(f"Failed to remove encryption from {filename}. Error: {e}")
Upvotes: 3
Reputation: 735
Refer, pikepdf, which is based on qpdf
. It automatically converts pdfs to be extractable.
Code for Reference:
import pikepdf
def remove_password_from_pdf(filename, password=None):
pdf = pikepdf.open(filename, password=password)
pdf.save("pdf_file_with_no_password.pdf")
if __name__ == "__main__":
remove_password_from_pdf(filename="/path/to/file")
Upvotes: 63
Reputation: 1247
If you've forgotten the password to your PDF, below is a generic script which tries a LOT of password combinations on the same PDF. It uses pikepdf
, but you can update the function check_password
to use something else.
I used this when I had forgotten a password on a bank PDF. I knew that my bank always encrypts these kind of PDFs with the same password-structure:
I call script as follows:
check_passwords(
pdf_file_path='/Users/my_name/Downloads/XXXXXXXX.pdf',
combination=[
ALPHABET_UPPERCASE,
ALPHABET_UPPERCASE,
ALPHABET_UPPERCASE,
ALPHABET_UPPERCASE,
NUMBER,
NUMBER,
NUMBER,
NUMBER,
]
)
(Requires Python3.8, with libraries numpy
and pikepdf
)
from typing import *
from itertools import product
import time, pikepdf, math, numpy as np
from pikepdf import PasswordError
ALPHABET_UPPERCASE: Sequence[str] = tuple('ABCDEFGHIJKLMNOPQRSTUVWXYZ')
ALPHABET_LOWERCASE: Sequence[str] = tuple('abcdefghijklmnopqrstuvwxyz')
NUMBER: Sequence[str] = tuple('0123456789')
def as_list(l):
if isinstance(l, (list, tuple, set, np.ndarray)):
l = list(l)
else:
l = [l]
return l
def human_readable_numbers(n, decimals: int = 0):
n = round(n)
if n < 1000:
return str(n)
names = ['', 'thousand', 'million', 'billion', 'trillion', 'quadrillion']
n = float(n)
idx = max(0,min(len(names)-1,
int(math.floor(0 if n == 0 else math.log10(abs(n))/3))))
return f'{n/10**(3*idx):.{decimals}f} {names[idx]}'
def check_password(pdf_file_path: str, password: str) -> bool:
## You can modify this function to use something other than pike pdf.
## This function should throw return True on success, and False on password-failure.
try:
pikepdf.open(pdf_file_path, password=password)
return True
except PasswordError:
return False
def check_passwords(pdf_file_path, combination, log_freq: int = int(1e4)):
combination = [tuple(as_list(c)) for c in combination]
print(f'Trying all combinations:')
for i, c in enumerate(combination):
print(f"{i}) {c}")
num_passwords: int = np.product([len(x) for x in combination])
passwords = product(*combination)
success: bool | str = False
count: int = 0
start: float = time.perf_counter()
for password in passwords:
password = ''.join(password)
if check_password(pdf_file_path, password=password):
success = password
print(f'SUCCESS with password "{password}"')
break
count += 1
if count % int(log_freq) == 0:
now = time.perf_counter()
print(f'Tried {human_readable_numbers(count)} ({100*count/num_passwords:.1f}%) of {human_readable_numbers(num_passwords)} passwords in {(now-start):.3f} seconds ({human_readable_numbers(count/(now-start))} passwords/sec). Latest password tried: "{password}"')
end: float = time.perf_counter()
msg: str = f'Tried {count} passwords in {1000*(end-start):.3f}ms ({count/(end-start):.3f} passwords/sec). '
msg += f"Correct password: {success}" if success is not False else f"All {num_passwords} passwords failed."
print(msg)
check_password
uses pikepdf, which loads the file from disk for every "check". This is really slow, ideally it should run against an in-memory copy. I haven't figured out a way to do that, though.Upvotes: 0
Reputation: 3447
Full disclosure, I am one of the maintainers of pdfminer.six. It is a community-maintained version of pdfminer for python 3.
This issue was fixed in 2020 by disabling the check_extractable
by default. It now shows a warning instead of raising an error.
Similar question and answer here.
Upvotes: 2
Reputation: 97
I used below code using pikepdf and able to overwrite.
import pikepdf
pdf = pikepdf.open('filepath', allow_overwriting_input=True)
pdf.save('filepath')
Upvotes: 8
Reputation: 53
If you want to unlock all pdf files in a folder without renaming them, you may use this code:
import glob, os, pikepdf
p = os.getcwd()
for file in glob.glob('*.pdf'):
file_path = os.path.join(p, file).replace('\\','/')
init_pdf = pikepdf.open(file_path)
new_pdf = pikepdf.new()
new_pdf.pages.extend(init_pdf.pages)
new_pdf.save(str(file))
In pikepdf library it is impossible to overwrite the existing file by saving it with the same name. In contrast, you would like to copy the pages to the newly created empty pdf file, and save it.
Upvotes: 0
Reputation: 99
I too faced the same problem of parsing the secured pdf but it has got resolved using pikepdf library. I tried this library on my jupyter notebbok and on windows os but it gave errors but it worked smoothly on Ubuntu
Upvotes: -1
Reputation: 444
The 'check_extractable=True' argument is by design. Some PDFs explicitly disallow to extract text, and PDFMiner follows the directive. You can override it (giving check_extractable=False), but do it at your own risk.
Upvotes: 2
Reputation: 8407
In my case there was no password, but simply setting check_extractable=False
circumvented the PDFTextExtractionNotAllowed
exception for a problematic file (that opened fine in other viewers).
Upvotes: 3
Reputation: 3226
As far as I know, in most cases the full content of the PDF is actually encrypted, using the password as the encryption key, and so simply setting .is_extractable
to True
isn't going to help you.
Per this thread:
Does a library exist to remove passwords from PDFs programmatically?
I would recommend removing the read-protection with a command-line tool such as qpdf
(easily installable, e.g. on Ubuntu use apt-get install qpdf
if you don't have it already):
qpdf --password=PASSWORD --decrypt SECURED.pdf UNSECURED.pdf
Then open the unlocked file with pdfminer
and do your stuff.
For a pure-Python solution, you can try using PyPDF2
and its .decrypt()
method, but it doesn't work with all types of encryption, so really, you're better off just using qpdf
- see:
https://github.com/mstamy2/PyPDF2/issues/53
Upvotes: 30