Error in the coding of the characters in reading a PDF

Question

I need to read this PDF.

I am using the following code:

from PyPDF2 import PdfFileReader

f = open('myfile.pdf', 'rb')
reader = PdfFileReader(f)
content = reader.getPage(0).extractText()
f.close()
content = ' '.join(content.replace('\xa0', ' ').strip().split())

print(content)

However, the encoding is incorrect, it prints:

Resultado da Prova de Sele“‰o do...

But I expected

Resultado da Prova de Seleção do...

How to solve it?

I'm using Python 3

Michelle Welcks · Accepted Answer

The PyPDF2 extractTest method returns UniCode. So you many need to just explicitly encode it. For example, explicitly encoding the Unicode into UTF-8.

# -*- coding: utf-8 -*-
correct = u'Resultado da Prova de Seleção do...'
print(correct.encode(encoding='utf-8'))

You're on Python 3, so you have Unicode under the hood, and Python 3 defaults to UTF-8. But I wonder if you need to specify a different encoding based on your locale.

# Show installed locales
import locale
from pprint import pprint
pprint(locale.locale_alias)

If that's not the quick fix, since you're getting Unicode back from PyPDF, you could take a look at the code points for those two characters. It's possible that PyPDF wasn't able to determine the correct encoding and gave you the wrong characters.

For example, a quick and dirty comparison of the good and bad strings you posted:

# -*- coding: utf-8 -*-
# Python 3.4
incorrect = 'Resultado da Prova de Sele“‰o do'
correct = 'Resultado da Prova de Seleção do...'

print("Incorrect String")
print("CHAR{}UNI".format(' ' * 20))
print("-" * 50)
for char in incorrect:
    print(
        '{}{}{}'.format(
            char.encode(encoding='utf-8'),
            ' ' * 20,  # Hack; Byte objects don't have __format__
            ord(char)
        )
    )

print("
" * 2)

print("Correct String")
print("CHAR{}UNI".format(' ' * 20))
print("-" * 50)
for char in correct:
    print(
        '{}{}{}'.format(
            char.encode(encoding='utf-8'),
            ' ' * 20,  # Hack; Byte objects don't have __format__
            ord(char)
        )
    )

Relevant Output:

b'\xe2\x80\x9c' 8220
b'\xe2\x80\xb0' 8240

b'\xc3\xa7' 231
b'\xc3\xa3' 227

If you're getting code point 231, (>>>hex(231) # '0xe7) then you're getting back bad data back from PyPDF.

Error in the coding of the characters in reading a PDF

Answers (2)

Related Questions