Reputation: 446
I have a pandas dataframe with hex values as given below:
df['col1']
<0020>
<0938>
<002E>
<092B>
<092B>
<0916>
<0915>
<0915>
<096F>
<096C>
I want to convert the hex values
to their corresponding unicode literals. So, I try to do the following:
df['col1'] = df['col1'].apply(lambda x : '\u' + str(x)[1:-1])
Hoping, that this would convert it to my required unicode literal, but I get the following error:
File "<ipython-input-22-891ccdd39e79>", line 1
df['col1'].apply(lambda x : '\u' + str(x)[1:-1])
^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: truncated \uXXXX escape
In python3, when we try the following we get :
>>> string1 = '\u03b4'
>>> print(string1)
>>> δ
So, I tried adding \u to my given string, I also tried adding \\u
, but that shows up as two backslashes . Also, adding a r
before \u
, also ends up showing two backslashes, instead of the unicode literal. I also tried decode-unicode
, but it didn't work either.
Also, it'd be great, if someone can explain the concept of rawstrings , \u, etc.
Upvotes: 1
Views: 856
Reputation: 3168
In order to convert all your codes into unicode here one line:
import codecs
import pandas as pd
(
# create a series with the prefix "\u" to add to the existing column
pd.Series([r'\u'] * len(df['col1']))
# str.strip deletes the "<" and ">" from your column
# str.cat concatenates the prefix created before to the existing column
.str.cat(df['col1'].str.strip('<>'))
# then you apply a conversion from the raw string to normal string.
.apply(codecs.decode, args=['unicode_escape'])
)
In the previous code, you have to create the prefix as a raw string. If not, Python is waiting for a valid utf-8 code (the error you have in your code).
Edit: I add the explanation from Serge Ballesta post
\uxxxx escape sequences are intended for the Python parser. When they are found in a string literal they are automatically replaced with the unicode character having that code point. You can use the codecs module and the unicode_escape encoding to decode a string that would contain actual \u character characters (meaning that you escape the backslash as in "\uxxx", but as you have directly an hex representation of the code point, it is simpler to directly use the chr function.
His solution is more elegant than mine.
Upvotes: 1
Reputation: 149085
Oops, literals are for... literal values! As soon as you have variables, you should use conversion functions like int
and chr
.
Here you have a column containing strings. For each cell in the column, you want to remove first and last character, process what remains as an hex value, and get the unicode character with that code point. In Python, it just reads:
df['col1'].apply(lambda x: chr(int(x[1:-1], 16)))
And with your values, it gives:
0
1 स
2 .
3 फ
4 फ
5 ख
6 क
7 क
8 ९
9 ६
Now for the reason of your error.
\uxxxx
escape sequences are intended for the Python parser. When they are found in a string literal they are automatically replaced with the unicode character having that code point. You can use the codecs
module and the unicode_escape
encoding to decode a string that would contain actual \u
character characters (meaning that you escape the backslash as in "\\uxxx"
, but as you have directly an hex representation of the code point, it is simpler to directly use the chr
function.
And in your initial code, when you write '\u'
, the parser sees the initial part of an encoded character and tries to decode it immediately... but cannot find the hexa code point after it, so it throws the exception. If you really want to go that way, you have to double the backslash (\
) to escape it and store it as is in the string and then use codecs.decode(..., encoding='unicode_escape')
to decode the string as shown in @ndclt's answer. But I do not advise you to do so.
References are to be found in the Standard Python Library documentation, chr
function and codecs
module.
Upvotes: 2