Reputation: 866

How to delete substrings with specific characters in a pandas dataframe?

I have a pandas dataframe that looks like this:

COL

hi A/P_90890 how A/P_True A/P_/93290 are AP_wueiwo A/P_|iwoeu you A/P_?9028k ?
...
 Im  fine, what A/P_49 A/P_0.0309 about you?

The expected result should be:

COL

hi how are you?
...
Im fine, what about you?

How can I remove efficiently from a column and for the full pandas dataframe all the strings that have A/P_?

I tried with this regular expression:

A/P_(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+

However, I do not know if there's a more simpler or robust way of removing all those substrings from my dataframe. How can I remove all the strings that have A/P_ at the beginning?

UPDATE

I tried:

df_sess['COL'] = df_sess['COL'].str.replace(r'A/P(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', '')

And it works, however I would like to know if there's a more robust way of doing this. Possibily with a regular expression.

Upvotes: 3

Answers (3)

Ben.T

Reputation: 29635

one way could be to use \S* matching all non withespaces after A/P_ and also add \s to remove the whitespace after the string to remove, such as:

df_sess['COL'] = df_sess['col'].str.replace(r'A/P_\S*\s', '')

In you input, it seems there is an typo error (or at least I think so), so with this input:

df_sess = pd.DataFrame({'col':['hi A/P_90890 how A/P_True A/P_/93290 are A/P_wueiwo A/P_|iwoeu you A/P_?9028k ?',
                              'Im fine, what A/P_49 A/P_0.0309 about you?']})
print (df_sess['col'].str.replace(r'A/P_\S*\s', ''))
0            hi how are you ?
1    Im fine, what about you?
Name: col, dtype: object

you get the expected output

Upvotes: 3

RomanPerekhrest

Reputation: 92854

Because of pandas 0.23.0 bug in replace() function (https://github.com/pandas-dev/pandas/issues/21159) when trying to replace by regex pattern the error occurs:

df.COL.str.replace(regex_pat, '', regex=True)
...
--->
TypeError: Type aliases cannot be used with isinstance().

I would suggest to use pandas.Series.apply function with precompiled regex pattern:

In [1170]: df4 = pd.DataFrame({'COL': ['hi A/P_90890 how A/P_True A/P_/93290 are AP_wueiwo A/P_|iwoeu you A/P_?9028k ?', 'Im  fine, what A/P_49 A/P_0.0309 about you?']})

In [1171]: pat = re.compile(r'\s*A/?P_[^\s]*')

In [1172]: df4['COL']= df4.COL.apply(lambda x: pat.sub('', x))

In [1173]: df4
Out[1173]: 
                         COL
0           hi how are you ?
1  Im  fine, what about you?

Upvotes: 2

Anton vBR

Reputation: 18906

How about:

(df['COL'].replace('A[/P|P][^ ]+', '', regex=True)
          .replace('\s+',' ', regex=True))

Full example:

import pandas as pd

df = pd.DataFrame({
    'COL': 
    ["hi A/P_90890 how A/P_True A/P_/93290 AP_wueiwo A/P_|iwoeu you A/P_?9028k ?",
    "Im  fine, what A/P_49 A/P_0.0309 about you?"]
})

df['COL'] = (df['COL'].replace('A[/P|P][^ ]+', '', regex=True)
                      .replace('\s+',' ', regex=True))

Returns (oh, there is an extra space before ?):

                        COL
0              hi how you ?
1  Im fine, what about you?

Upvotes: 2

How to delete substrings with specific characters in a pandas dataframe?

Answers (3)

Related Questions