jackiegirl89
jackiegirl89

Reputation: 87

How to remove strings starting with  and containing special characters pandas

I have a dataframe with a column containing text. This data is coming from and being saved to a csv file and contains strings such as:

 Supporter🇨🇮
 🇮🇪🇪🇺
 üìû061 300149 üíª[email protected]

Is it possible to remove these strings from the textual data? If so what is the best way to do this?

I have tried:

 df['text'] = df['text'].replace(r'(?<![@\w])(^\W+)', '', regex=True)

But unfortunately it doesn't remove the strings.

Thanks!

Upvotes: 0

Views: 759

Answers (2)

perl
perl

Reputation: 9941

For example for the following DataFrame

                Supporter
0                🇨🇮
1                     foo
2        🇮🇪🇪🇺
3          üìû061 300149
4                     bar
5  üíª[email protected]

we can use str.match to remove any line containing special characters:

df.loc[~df['Supporter'].str.match('[\u0080-\uFFFF]')]

Output:

  Supporter
1       foo
4       bar

Also, if you want to just remove special characters while keeping the actual records:

df['Supporter'] = df['Supporter'].str.replace('[\u0080-\uFFFF]', '')

print(df)

Output:

    Supporter
0            
1         foo
2            
3  061 300149
4         bar

Note: If there are any NA values in the DataSet, they should be dropped before running these with:

df = df.dropna()

Upvotes: 1

rdas
rdas

Reputation: 21275

You can try the methods described here: Replace non-ASCII characters with a single space

Instead of replacing with a space, pass the empty string '' to get rid of the characters.

Upvotes: 0

Related Questions