shan
shan

Reputation: 583

Removing \xf characters

I am trying to remove all

\xf0\x9f\x93\xa2, \xf0\x9f\x95\x91\n\, \xe2\x80\xa6,\xe2\x80\x99t 

type characters from the below strings in Python

    Text
  _____________________________________________________
"b'Hello! \xf0\x9f\x93\xa2 End Climate Silence is looking for volunteers! \n\n1-2 hours per week. \xf0\x9f\x95\x91\n\nExperience doing digital research\xe2\x80\xa6

"b'I doubt if climate emergency 8s real, I think people will look ba\xe2\x80\xa6 '

"b'No, thankfully it doesn\xe2\x80\x99t. Can\xe2\x80\x99t see how cheap to overtourism in the alan alps can h\xe2\x80\xa6"

"b'Climate Change Poses a WidelllThreat to National Security "

"b""This doesn't feel like targeted propaganda at all. I mean states\xe2\x80\xa6"

"b'berates climate change activist who confronted her in airport\xc2\xa0 

The above content is in pandas dataframe as a column..

I am trying

string.encode('ascii', errors= 'ignore') 

and regex but without luck. It will be helpful if I can get some suggestions.

Upvotes: 1

Views: 680

Answers (3)

Mann Jain
Mann Jain

Reputation: 21

These are hexadecimal escape characters which arrives after encoding. All occurences of type \x[AB] where A or B can be [0123456789abcdefABCDEF] can be considered of this form. Try using regex with a pattern. \\x[0123456789abcdefABCDEF][0123456789abcdefABCDEF]

Upvotes: 1

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627334

Your current data in the question indicates you are using bytestring representations re-encoded as Unicode strings, and now you want to decode those bytestrings, but first, you need to encode the strings back to the bytestrings.

So, in your case, you can use

x = "b""This doesn't feel like targeted propaganda at all. I mean states\xe2\x80\xa6"
x = x.encode('latin1').decode('unicode-escape').encode('latin1').decode('utf8')
print(x)
# => bThis doesn't feel like targeted propaganda at all. I mean states…

See this Python demo.

Upvotes: 1

Bendik Knapstad
Bendik Knapstad

Reputation: 1458

try decoding the bytes.

text=b'Hello! \xf0\x9f\x93\xa2 End Climate Silence is looking for volunteers! \n\n1-2 hours per week. \xf0\x9f\x95\x91\n\nExperience doing digital research\xe2\x80\xa6'.decode("utf8")
print(text) 
>> Hello! 📢 End Climate Silence is looking for volunteers! 

1-2 hours per week. 🕑

Upvotes: 0

Related Questions