Unable to convert hex code to unicode characters, get unicodeescape error

Question

I have a pandas dataframe with hex values as given below:

df['col1']

<0020>
<0938>
<002E>
<092B>
<092B>
<0916>
<0915>
<0915>
<096F>
<096C>

I want to convert the hex values to their corresponding unicode literals. So, I try to do the following:

df['col1'] = df['col1'].apply(lambda x : '\u' + str(x)[1:-1])

Hoping, that this would convert it to my required unicode literal, but I get the following error:

File "", line 1
    df['col1'].apply(lambda x : '\u' +  str(x)[1:-1])
                                      ^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: truncated \uXXXX escape

In python3, when we try the following we get :

>>> string1 = '\u03b4'
>>> print(string1)
>>> δ

So, I tried adding \u to my given string, I also tried adding \u, but that shows up as two backslashes . Also, adding a r before \u, also ends up showing two backslashes, instead of the unicode literal. I also tried decode-unicode, but it didn't work either.

Also, it'd be great, if someone can explain the concept of rawstrings , \u, etc.

Serge Ballesta · Accepted Answer

Oops, literals are for... literal values! As soon as you have variables, you should use conversion functions like int and chr.

Here you have a column containing strings. For each cell in the column, you want to remove first and last character, process what remains as an hex value, and get the unicode character with that code point. In Python, it just reads:

df['col1'].apply(lambda x: chr(int(x[1:-1], 16)))

And with your values, it gives:

Now for the reason of your error.

\uxxxx escape sequences are intended for the Python parser. When they are found in a string literal they are automatically replaced with the unicode character having that code point. You can use the codecs module and the unicode_escape encoding to decode a string that would contain actual \u character characters (meaning that you escape the backslash as in "\uxxx", but as you have directly an hex representation of the code point, it is simpler to directly use the chr function.

And in your initial code, when you write '\u', the parser sees the initial part of an encoded character and tries to decode it immediately... but cannot find the hexa code point after it, so it throws the exception. If you really want to go that way, you have to double the backslash (\) to escape it and store it as is in the string and then use codecs.decode(..., encoding='unicode_escape') to decode the string as shown in @ndclt's answer. But I do not advise you to do so.

References are to be found in the Standard Python Library documentation, chr function and codecs module.

Unable to convert hex code to unicode characters, get unicodeescape error

Answers (2)

Related Questions