How to decode backslash scapes strings in python?

Question

I have a csv file (see here) that contains meta data from posts of a public page in Facebook. I need to decode all the content like: \xc3\xa9 and \xf0\x9f\x91\xa9\xf0\x9f\x8f\xbb\xe2\x80\x8d\xf0\x9f\x92\xbc

The meta data "post message" is:

"b'Bom dia, genteee! Me disseram que esse emoji \xc3\xa9 a minha cara: \xf0\x9f\x91\xa9\xf0\x9f\x8f\xbb\xe2\x80\x8d\xf0\x9f\x92\xbc O que voc\xc3\xaas acham?'"

and its type is str object.

I need convert it to:

Bom dia, genteee! Me disseram que esse emoji é a minha cara: 👩🏻‍💼 O que vocês acham?

How I do this? I need convert all csv.

edit 1: I tried

My_string = post_message.split("b'")[1].split("'")[0]
My_string.encode().decode('unicode_escape')

but the result it's different than I expected:

Bom dia, genteee! Me disseram que esse emoji Ã© a minha cara: ð©ð»âð¼ O que vocÃªs acham?

Solution:

As @Ben pointed out, my data is a string object that contains bytes, not bytes object. So used the @ShadowRanger solution (see his answer here). I did:

My_string = post_message[2:len(post_message)-1] #to remove "b'" from begining and "'" from end
My_string = My_string.encode('utf-8').decode('unicode_escape').encode('latin-1').decode('utf-8')

The result:

Bom dia, genteee! Me disseram que esse emoji é a minha cara: 👩🏻‍💼 O que vocês acham?

Ben · Accepted Answer

I notice that the string you posted looks like "b'...'", with double quotes around a single quoted string with b prefixed. That looks like a string containing the text representation of a bytestring, as opposed to a bytestring being printed as text.

For example:

>>> text = 'föő'
>>> text
'föő'
>>> bytestring = text.encode()
>>> bytestring
b'f\xc3\xb6\xc5\x91'
>>> str(bytestring)
"b'f\xc3\xb6\xc5\x91'"

It suggests you had a bytestring at some point and called str on it (or something similar) to turn it into a text string. That gives you the text representation of the bytestring, not the text that the bytestring is the encoding of.

However, if that theory were entirely correct, you would have doubled backslashes, as you can see in my example above. So it doesn't entirely fit, if the data is exactly as you showed in the OP.

However, it still looks like code at some point had bytes and converted them to text incorrectly. I would strongly recommend you fix this by finding where that is happening and fixing it, rather than trying to correct this data after the fact.

How to decode backslash scapes strings in python?

Answers (1)

Related Questions