Reputation: 41

PyPDF2<=1.19 has issues with PDF encoding

I am trying to encrypt PDF files under Python 3.3.2 using PyPDF2.

The code is very simple:

password = 'password';
# password = password.encode('utf-8')
PDFout.encrypt(user_pwd=password,owner_pwd=password)

however I am getting the following errors, depending if the encoding is on or off:

on: TypeError: slice indices must be integers or None or have an __index__ method

off: TypeError: Can't convert 'bytes' object to str implicitly

Would you know by any chance how to resolve that problem?

Thanks and Regards Peter

Upvotes: 1

Answers (2)

Matthew Stamy

Reputation: 1164

Try installing the most recent version of PyPDF2 - it now fully supports Python 3!

It seems that "some" support was added in 1.16, but it didn't cover all features. Now, Py 3 should be fully compatible with this library.

Upvotes: 1

Christian Abbott

Reputation: 6967

It appears to me that the current version of PyPDF2 (1.19 as of this writing) has some bugs concerning compatibility with Python 3, and that is what is causing both error messages. The change log on GitHub for PyPDF2 indicates that Python 3 support was added in version 1.16, which was released only 3 1/2 months ago, so it is possible this bug hasn't either been reported or fixed yet. GitHub also shows that there is a branch of this project specifically for Python 3.3 support, which is not currently merged back into the main branch.

Both errors occur in the pdf.py file of the PyPDF2 module. Here is what is happening:

The PyPDF2 module creates some extra bytes as padding and concatenates it with your password. If the Python version is less than 3, the padding is created as a string literal. If the version is 3 or higher, the padding is encoded using the 'latin-1' encoding. In Python 3, this means the padding is a bytes object, and concatenating that with a string object (your password) produces the TypeError you saw. Under Python 2, the concatenation would work because both objects would be the same type.

When you encode your password using "utf-8", you resolve that problem since both the password and padding are bytes objects in that case. However, you end up running into a second bug later in the module. The pdf.py file creates and uses a variable "keylen" like this:

keylen = 128 / 8
... # later on in the code...
key = md5_hash[:keylen]

The division operator underwent a change in Python 2.2 which altered its default behavior starting in Python 3. In brief, "/" means floor division in Python 2 and returns an int, but it means true division in Python 3 and returns a float. Therefore, "keylen" would be 16 in Python 2, but instead 16.0 in Python 3. Floats, unlike ints, can't be used to splice arrays, so Python 3 throws the TypeError you saw when md5_hash[:keylen] is evaluated. Python 2 would run this without error, since keylen would be an int.

You could resolve this second problem by altering the module's source code to use the "//" operator (which means floor division and returns an int in both Python 2 and 3):

keylen = 128 // 8

However, you would then run into a third bug later in the code, also related to Python 3 compatibility. I won't belabor the point by describing it. The short answer to your question then, as far as I see it, is to either use Python 2, or patch the various code compatibility problems, or use a different PDF library for Python which has better support for Python 3 (if one exists which meets your particular requirements).

Upvotes: 1

PyPDF2&lt;=1.19 has issues with PDF encoding

Answers (2)

Related Questions

PyPDF2<=1.19 has issues with PDF encoding