Daniel Stephens
Daniel Stephens

Reputation: 3209

Escape hex character in string

I executed a function in Python on Windows that returned this string:

p = subprocess(args=["devenv.exe", "project.sln"], ...)
stdout, stderr = p.communicate()
print(stdout) # b'unzul\x84ssig'

This is supposed to be called unzulässig. I am wondering which decoder I need to use to convert it back to the word. string_escape nor utf8(of course not) did work. Can anyone help me?

Upvotes: 0

Views: 185

Answers (2)

chepner
chepner

Reputation: 531135

Looks like you may want code page 858:

>>> "unzulässig".encode('858')
b'unzul\x84ssig'

So

>>> res = b'unzul\x84ssig'
>>> res.decode('858')
'unzulässig'

As @deceze pointed out in a comment, IBM437 and IBM850 are also possibilities.

>>> res.decode('ibm437')
'unzulässig'
>>> res.decode('ibm850')
'unzulässig'

There is lots of overlap between various character sets, but based on this small sample, all we can do is suggest ones that are known to map 'ä' to b'\x84'. For example, my original suggestion for 858 was noticing at https://en.wikipedia.org/wiki/Windows_code_page that 858 was a DOS code page for Western European languages (with euro sign). There are lots of single-byte encodings that may be identical for most code points (even ignoring 0-127, which very often share the same ASCII roots), but may differ at select values.

Upvotes: 4

jsbueno
jsbueno

Reputation: 110271

res = function().decode("cp852")
print(res) # b'unzul\x84ssig'

How do you know it is cp852? You have to know that from the documentation of your function, or for the data source to it. There is not such thing as 'text' if you are getting an input of bytes - you have to know which encoding was used to represent the desired text as those bytes.

I suggest reading https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/

(In particular, under WIndows, the cmd window does use old DOS encoding, for being compatible with 1980 era code - A Python interpreter started from the CMD shell will probably reflect this encoding in the sys.stdout.encoding attribute)

Upvotes: 3

Related Questions