I have a dataframe with a column containing text. This data is coming from and being saved to a csv file and contains strings such as: Supporterüá®üáÆ üáÆüá™üá™üá∫ üìû061 300149 üíªsdim.csdg@dsga.com Is it possible to remove these strings from the textual data? If so what is the best way to do this? I have tried: df['text'] = df['text'].replace(r'(?<![@\w])(^\W+)', '', regex=True) But unfortunately it doesn't remove the strings. Thanks!

Reputation: 87

How to remove strings starting with  and containing special characters pandas

I have a dataframe with a column containing text. This data is coming from and being saved to a csv file and contains strings such as:

 Supporterüá®üáÆ
 üáÆüá™üá™üá∫
 üìû061 300149 üíª[email protected]

Is it possible to remove these strings from the textual data? If so what is the best way to do this?

I have tried:

 df['text'] = df['text'].replace(r'(?<![@\w])(^\W+)', '', regex=True)

But unfortunately it doesn't remove the strings.

Thanks!

Upvotes: 0

Answers (2)

perl

Reputation: 9941

For example for the following DataFrame

                Supporter
0                üá®üáÆ
1                     foo
2        üáÆüá™üá™üá∫
3          üìû061 300149
4                     bar
5  üíª[email protected]

we can use str.match to remove any line containing special characters:

df.loc[~df['Supporter'].str.match('[\u0080-\uFFFF]')]

Output:

  Supporter
1       foo
4       bar

Also, if you want to just remove special characters while keeping the actual records:

df['Supporter'] = df['Supporter'].str.replace('[\u0080-\uFFFF]', '')

print(df)

Output:

    Supporter
0            
1         foo
2            
3  061 300149
4         bar

Note: If there are any NA values in the DataSet, they should be dropped before running these with:

df = df.dropna()

Upvotes: 1

rdas

Reputation: 21275

You can try the methods described here: Replace non-ASCII characters with a single space

Instead of replacing with a space, pass the empty string '' to get rid of the characters.

Upvotes: 0

How to remove strings starting with  and containing special characters pandas

Answers (2)

Related Questions