Reputation: 583
I am trying to remove all
\xf0\x9f\x93\xa2, \xf0\x9f\x95\x91\n\, \xe2\x80\xa6,\xe2\x80\x99t
type characters from the below strings in Python
Text
_____________________________________________________
"b'Hello! \xf0\x9f\x93\xa2 End Climate Silence is looking for volunteers! \n\n1-2 hours per week. \xf0\x9f\x95\x91\n\nExperience doing digital research\xe2\x80\xa6
"b'I doubt if climate emergency 8s real, I think people will look ba\xe2\x80\xa6 '
"b'No, thankfully it doesn\xe2\x80\x99t. Can\xe2\x80\x99t see how cheap to overtourism in the alan alps can h\xe2\x80\xa6"
"b'Climate Change Poses a WidelllThreat to National Security "
"b""This doesn't feel like targeted propaganda at all. I mean states\xe2\x80\xa6"
"b'berates climate change activist who confronted her in airport\xc2\xa0
The above content is in pandas dataframe as a column..
I am trying
string.encode('ascii', errors= 'ignore')
and regex but without luck. It will be helpful if I can get some suggestions.
Upvotes: 1
Views: 680
Reputation: 21
These are hexadecimal escape characters which arrives after encoding.
All occurences of type \x[AB] where A or B can be [0123456789abcdefABCDEF
] can be considered of this form. Try using regex with a pattern. \\x[0123456789abcdefABCDEF][0123456789abcdefABCDEF]
Upvotes: 1
Reputation: 627334
Your current data in the question indicates you are using bytestring representations re-encoded as Unicode strings, and now you want to decode those bytestrings, but first, you need to encode the strings back to the bytestrings.
So, in your case, you can use
x = "b""This doesn't feel like targeted propaganda at all. I mean states\xe2\x80\xa6"
x = x.encode('latin1').decode('unicode-escape').encode('latin1').decode('utf8')
print(x)
# => bThis doesn't feel like targeted propaganda at all. I mean states…
See this Python demo.
Upvotes: 1
Reputation: 1458
try decoding the bytes.
text=b'Hello! \xf0\x9f\x93\xa2 End Climate Silence is looking for volunteers! \n\n1-2 hours per week. \xf0\x9f\x95\x91\n\nExperience doing digital research\xe2\x80\xa6'.decode("utf8")
print(text)
>> Hello! 📢 End Climate Silence is looking for volunteers!
1-2 hours per week. 🕑
Upvotes: 0