Reputation: 87
I have a dataframe with a column containing text. This data is coming from and being saved to a csv file and contains strings such as:
Supporterüá®üáÆ
üáÆüá™üá™üá∫
üìû061 300149 üíª[email protected]
Is it possible to remove these strings from the textual data? If so what is the best way to do this?
I have tried:
df['text'] = df['text'].replace(r'(?<![@\w])(^\W+)', '', regex=True)
But unfortunately it doesn't remove the strings.
Thanks!
Upvotes: 0
Views: 759
Reputation: 9941
For example for the following DataFrame
Supporter
0 üá®üáÆ
1 foo
2 üáÆüá™üá™üá∫
3 üìû061 300149
4 bar
5 üíª[email protected]
we can use str.match
to remove any line containing special characters:
df.loc[~df['Supporter'].str.match('[\u0080-\uFFFF]')]
Output:
Supporter
1 foo
4 bar
Also, if you want to just remove special characters while keeping the actual records:
df['Supporter'] = df['Supporter'].str.replace('[\u0080-\uFFFF]', '')
print(df)
Output:
Supporter
0
1 foo
2
3 061 300149
4 bar
Note: If there are any NA
values in the DataSet, they should be dropped before running these with:
df = df.dropna()
Upvotes: 1
Reputation: 21275
You can try the methods described here: Replace non-ASCII characters with a single space
Instead of replacing with a space
, pass the empty string ''
to get rid of the characters.
Upvotes: 0