kramer65
kramer65

Reputation: 53873

How to unlock a "secured" (read-protected) PDF in Python?

In Python I'm using pdfminer to read the text from a pdf with the code below this message. I now get an error message saying:

File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfpage.py", line 124, in get_pages
    raise PDFTextExtractionNotAllowed('Text extraction is not allowed: %r' % fp)
PDFTextExtractionNotAllowed: Text extraction is not allowed: <cStringIO.StringO object at 0x7f79137a1
ab0>

When I open this pdf with Acrobat Pro it turns out it is secured (or "read protected"). From this link however, I read that there's a multitude of services which can disable this read-protection easily (for example pdfunlock.com. When diving into the source of pdfminer, I see that the error above is generated on these lines.

if check_extractable and not doc.is_extractable:
    raise PDFTextExtractionNotAllowed('Text extraction is not allowed: %r' % fp)

Since there's a multitude of services which can disable this read-protection within a second, I presume it is really easy to do. It seems that .is_extractable is a simple attribute of the doc, but I don't think it is as simple as changing .is_extractable to True..

Does anybody know how I can disable the read protection on a pdf using Python? All tips are welcome!

================================================

Below you will find the code with which I currently extract the text from non-read protected.

def getTextFromPDF(rawFile):
    resourceManager = PDFResourceManager(caching=True)
    outfp = StringIO()
    device = TextConverter(resourceManager, outfp, codec='utf-8', laparams=LAParams(), imagewriter=None)
    interpreter = PDFPageInterpreter(resourceManager, device)

    fileData = StringIO()
    fileData.write(rawFile)
    for page in PDFPage.get_pages(fileData, set(), maxpages=0, caching=True, check_extractable=True):
        interpreter.process_page(page)
    fileData.close()
    device.close()

    result = outfp.getvalue()

    outfp.close()
    return result

Upvotes: 34

Views: 70122

Answers (10)

Christopher
Christopher

Reputation: 627

pikepdf didn't work for me. I found a solution using PyPDF2 to unencrypt all files in current working directory.

import os
from PyPDF2 import PdfReader, PdfWriter

def remove_encryption_from_pdf(input_path, output_path):
    with open(input_path, "rb") as file:
        reader = PdfReader(file)
        if reader.is_encrypted:
            writer = PdfWriter()
            for page in reader.pages:
                writer.add_page(page)
            with open(output_path, "wb") as output_pdf:
                writer.write(output_pdf)

if __name__ == "__main__":
    directory_path = os.getcwd()  # get current directory path
    
    for filename in os.listdir(directory_path):
        if filename.endswith('.pdf'):
            input_path = os.path.join(directory_path, filename)
            output_path = os.path.join(directory_path, "decrypted_" + filename)
            print(f"Processing {filename}")  # print the file name
            try:
                remove_encryption_from_pdf(input_path, output_path)
                print(f"Encryption removed from {filename}")
            except Exception as e:
                print(f"Failed to remove encryption from {filename}. Error: {e}")

Upvotes: 3

IanJ
IanJ

Reputation: 735

Refer, pikepdf, which is based on qpdf. It automatically converts pdfs to be extractable.

Code for Reference:

import pikepdf
def remove_password_from_pdf(filename, password=None):
    pdf = pikepdf.open(filename, password=password)
    pdf.save("pdf_file_with_no_password.pdf")

if __name__ == "__main__":
    remove_password_from_pdf(filename="/path/to/file")

Upvotes: 63

Abhishek Divekar
Abhishek Divekar

Reputation: 1247

If you've forgotten the password to your PDF, below is a generic script which tries a LOT of password combinations on the same PDF. It uses pikepdf, but you can update the function check_password to use something else.

Usage example:

I used this when I had forgotten a password on a bank PDF. I knew that my bank always encrypts these kind of PDFs with the same password-structure:

  1. Total length = 8
  2. First 4 characters = an uppercase letter.
  3. Last 4 characters = a number.

I call script as follows:

check_passwords(
    pdf_file_path='/Users/my_name/Downloads/XXXXXXXX.pdf',
    combination=[
        ALPHABET_UPPERCASE,
        ALPHABET_UPPERCASE,
        ALPHABET_UPPERCASE,
        ALPHABET_UPPERCASE,
        NUMBER,
        NUMBER,
        NUMBER,
        NUMBER,
    ]
)

Password-checking script:

(Requires Python3.8, with libraries numpy and pikepdf)

from typing import *
from itertools import product
import time, pikepdf, math, numpy as np
from pikepdf import PasswordError

ALPHABET_UPPERCASE: Sequence[str] = tuple('ABCDEFGHIJKLMNOPQRSTUVWXYZ')
ALPHABET_LOWERCASE: Sequence[str] = tuple('abcdefghijklmnopqrstuvwxyz')
NUMBER: Sequence[str] = tuple('0123456789')

def as_list(l):
    if isinstance(l, (list, tuple, set, np.ndarray)):
        l = list(l)
    else:
        l = [l]
    return l

def human_readable_numbers(n, decimals: int = 0):
    n = round(n)
    if n < 1000:
        return str(n)
    names = ['', 'thousand', 'million', 'billion', 'trillion', 'quadrillion']
    n = float(n)
    idx = max(0,min(len(names)-1,
                        int(math.floor(0 if n == 0 else math.log10(abs(n))/3))))

    return f'{n/10**(3*idx):.{decimals}f} {names[idx]}'

def check_password(pdf_file_path: str, password: str) -> bool:
    ## You can modify this function to use something other than pike pdf. 
    ## This function should throw return True on success, and False on password-failure.
    try:
        pikepdf.open(pdf_file_path, password=password)
        return True
    except PasswordError:
        return False


def check_passwords(pdf_file_path, combination, log_freq: int = int(1e4)):
    combination = [tuple(as_list(c)) for c in combination]
    print(f'Trying all combinations:')
    for i, c in enumerate(combination):
        print(f"{i}) {c}")
    num_passwords: int = np.product([len(x) for x in combination])
    passwords = product(*combination)
    success: bool | str = False
    count: int = 0
    start: float = time.perf_counter()
    for password in passwords:
        password = ''.join(password)
        if check_password(pdf_file_path, password=password):
            success = password
            print(f'SUCCESS with password "{password}"')
            break
        count += 1
        if count % int(log_freq) == 0:
            now = time.perf_counter()
            print(f'Tried {human_readable_numbers(count)} ({100*count/num_passwords:.1f}%) of {human_readable_numbers(num_passwords)} passwords in {(now-start):.3f} seconds ({human_readable_numbers(count/(now-start))} passwords/sec). Latest password tried: "{password}"')
    end: float = time.perf_counter()
    msg: str = f'Tried {count} passwords in {1000*(end-start):.3f}ms ({count/(end-start):.3f} passwords/sec). '
    msg += f"Correct password: {success}" if success is not False else f"All {num_passwords} passwords failed."
    print(msg)

Comments

  1. Obviously, don't use this to break into PDFs which are not your own. I hold no responsibility over how you use this script or any consequences of using it.
  2. A lot of optimizations can be made.
    • Right now check_password uses pikepdf, which loads the file from disk for every "check". This is really slow, ideally it should run against an in-memory copy. I haven't figured out a way to do that, though.
    • You can probably speed this up a LOT by calling qpdf directly using C++, which is much better than Python for this kind of stuff.
    • I would avoid multi-processing here, since we're calling the same qpdf binary (which is normally a system-wide installation), which might become the bottleneck.

Upvotes: 0

Pieter
Pieter

Reputation: 3447

Full disclosure, I am one of the maintainers of pdfminer.six. It is a community-maintained version of pdfminer for python 3.

This issue was fixed in 2020 by disabling the check_extractable by default. It now shows a warning instead of raising an error.

Similar question and answer here.

Upvotes: 2

Satish Dubey
Satish Dubey

Reputation: 97

I used below code using pikepdf and able to overwrite.

import pikepdf

pdf = pikepdf.open('filepath', allow_overwriting_input=True)
pdf.save('filepath')

Upvotes: 8

komutohirowato
komutohirowato

Reputation: 53

If you want to unlock all pdf files in a folder without renaming them, you may use this code:

import glob, os, pikepdf

p = os.getcwd()
for file in glob.glob('*.pdf'):
   file_path = os.path.join(p, file).replace('\\','/')
   init_pdf = pikepdf.open(file_path)
   new_pdf = pikepdf.new()
   new_pdf.pages.extend(init_pdf.pages)
   new_pdf.save(str(file))

In pikepdf library it is impossible to overwrite the existing file by saving it with the same name. In contrast, you would like to copy the pages to the newly created empty pdf file, and save it.

Upvotes: 0

Knoweldgeyog
Knoweldgeyog

Reputation: 99

I too faced the same problem of parsing the secured pdf but it has got resolved using pikepdf library. I tried this library on my jupyter notebbok and on windows os but it gave errors but it worked smoothly on Ubuntu

Upvotes: -1

AlfiyaFaisy
AlfiyaFaisy

Reputation: 444

The 'check_extractable=True' argument is by design. Some PDFs explicitly disallow to extract text, and PDFMiner follows the directive. You can override it (giving check_extractable=False), but do it at your own risk.

Upvotes: 2

jtlz2
jtlz2

Reputation: 8407

In my case there was no password, but simply setting check_extractable=False circumvented the PDFTextExtractionNotAllowed exception for a problematic file (that opened fine in other viewers).

Upvotes: 3

Jaza
Jaza

Reputation: 3226

As far as I know, in most cases the full content of the PDF is actually encrypted, using the password as the encryption key, and so simply setting .is_extractable to True isn't going to help you.

Per this thread:

Does a library exist to remove passwords from PDFs programmatically?

I would recommend removing the read-protection with a command-line tool such as qpdf (easily installable, e.g. on Ubuntu use apt-get install qpdf if you don't have it already):

qpdf --password=PASSWORD --decrypt SECURED.pdf UNSECURED.pdf

Then open the unlocked file with pdfminer and do your stuff.

For a pure-Python solution, you can try using PyPDF2 and its .decrypt() method, but it doesn't work with all types of encryption, so really, you're better off just using qpdf - see:

https://github.com/mstamy2/PyPDF2/issues/53

Upvotes: 30

Related Questions