Alex F
Alex F

Reputation: 2274

Convert a bytes object with hexadecimal characters to string?

I'm having way more trouble with this than I'd like to admit. I've checked numerous posts already with no luck. I'm trying to convert a byte object like this:

b = b'%PDF-1.5\r%\xe2\xe3\xcf\xd3\r\n'

into a string variable.

I have tried the following already,

import codecs
codecs.decode(b, 'hex')
# Error: decoding with 'hex' codec failed (Error: Non-hexadecimal digit found)

b.decode('hex')
# LookupError: 'hex' is not a text encoding; use codecs.decode() to handle arbitrary codecs


b.unhexlify(_)
#AttributeError: 'bytes' object has no attribute 'unhexlify'


str(b)
# just gives me the same bytes object with str type


b.decode('utf-8')
# UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 10: invalid continuation byte

Can anyone tell me what I'm doing wrong here?

Upvotes: 0

Views: 1821

Answers (2)

Matteo Italia
Matteo Italia

Reputation: 126777

What you have there is a PDF file; while partially ASCII-text based, PDF files are not plain text. You can find a way to decode even the magic bytes in the header (iso8859-1 should do), but as soon as you hit a deflate-compressed stream you'll have sequences of full-entropy 256 bytes, that cannot be decoded meaningfully with any codec.

IOW: there's no way to meaningfully decode the whole byte content of a PDF file to a Unicode string, as it's not a straight representation of Unicode codepoints of any kind. It's like trying to decode a JPEG file to a Unicode string: it makes no sense and it is not possible.

If you want to extract text from a PDF file you have to actually parse and decode its structure, which is not trivial at all.

Upvotes: 1

jedzej
jedzej

Reputation: 444

Actually b already is a string. You can know it by type checking and verifying that it prints all your special characters:

>>> b = b'%PDF-1.5\r%\xe2\xe3\xcf\xd3\r\n'
>>> type(b)
<type 'str'>
>>> print(b)
%ÔѤË1.5

>>>

If you have real bytes object you convert from bytes to string using .decode(encoding). Bad thing is that you need to know your encoding to do this.

I went trial-and-error with couple of encodings from this site: https://docs.python.org/2.4/lib/standard-encodings.html. It didn't produce the errors with iso8859_15, but I cannot guarantee it's a good one. Here is a snippet:

line.decode('iso8859_15')

Upvotes: 0

Related Questions