Reputation: 2274
I'm having way more trouble with this than I'd like to admit. I've checked numerous posts already with no luck. I'm trying to convert a byte object like this:
b = b'%PDF-1.5\r%\xe2\xe3\xcf\xd3\r\n'
into a string variable.
I have tried the following already,
import codecs
codecs.decode(b, 'hex')
# Error: decoding with 'hex' codec failed (Error: Non-hexadecimal digit found)
b.decode('hex')
# LookupError: 'hex' is not a text encoding; use codecs.decode() to handle arbitrary codecs
b.unhexlify(_)
#AttributeError: 'bytes' object has no attribute 'unhexlify'
str(b)
# just gives me the same bytes object with str type
b.decode('utf-8')
# UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 10: invalid continuation byte
Can anyone tell me what I'm doing wrong here?
Upvotes: 0
Views: 1821
Reputation: 126777
What you have there is a PDF file; while partially ASCII-text based, PDF files are not plain text. You can find a way to decode even the magic bytes in the header (iso8859-1 should do), but as soon as you hit a deflate-compressed stream you'll have sequences of full-entropy 256 bytes, that cannot be decoded meaningfully with any codec.
IOW: there's no way to meaningfully decode the whole byte content of a PDF file to a Unicode string, as it's not a straight representation of Unicode codepoints of any kind. It's like trying to decode a JPEG file to a Unicode string: it makes no sense and it is not possible.
If you want to extract text from a PDF file you have to actually parse and decode its structure, which is not trivial at all.
Upvotes: 1
Reputation: 444
Actually b
already is a string. You can know it by type checking and verifying that it prints all your special characters:
>>> b = b'%PDF-1.5\r%\xe2\xe3\xcf\xd3\r\n'
>>> type(b)
<type 'str'>
>>> print(b)
%ÔѤË1.5
>>>
If you have real bytes object you convert from bytes to string using .decode(encoding)
. Bad thing is that you need to know your encoding to do this.
I went trial-and-error with couple of encodings from this site: https://docs.python.org/2.4/lib/standard-encodings.html. It didn't produce the errors with iso8859_15, but I cannot guarantee it's a good one. Here is a snippet:
line.decode('iso8859_15')
Upvotes: 0