Reputation: 866
I have a pandas dataframe that looks like this:
COL
hi A/P_90890 how A/P_True A/P_/93290 are AP_wueiwo A/P_|iwoeu you A/P_?9028k ?
...
Im fine, what A/P_49 A/P_0.0309 about you?
The expected result should be:
COL
hi how are you?
...
Im fine, what about you?
How can I remove efficiently from a column and for the full pandas dataframe all the strings that have A/P_
?
I tried with this regular expression:
A/P_(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+
However, I do not know if there's a more simpler or robust way of removing all those substrings from my dataframe. How can I remove all the strings that have A/P_
at the beginning?
UPDATE
I tried:
df_sess['COL'] = df_sess['COL'].str.replace(r'A/P(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', '')
And it works, however I would like to know if there's a more robust way of doing this. Possibily with a regular expression.
Upvotes: 3
Views: 2086
Reputation: 29635
one way could be to use \S*
matching all non withespaces after A/P_
and also add \s
to remove the whitespace after the string to remove, such as:
df_sess['COL'] = df_sess['col'].str.replace(r'A/P_\S*\s', '')
In you input, it seems there is an typo error (or at least I think so), so with this input:
df_sess = pd.DataFrame({'col':['hi A/P_90890 how A/P_True A/P_/93290 are A/P_wueiwo A/P_|iwoeu you A/P_?9028k ?',
'Im fine, what A/P_49 A/P_0.0309 about you?']})
print (df_sess['col'].str.replace(r'A/P_\S*\s', ''))
0 hi how are you ?
1 Im fine, what about you?
Name: col, dtype: object
you get the expected output
Upvotes: 3
Reputation: 92854
Because of pandas 0.23.0 bug in replace()
function (https://github.com/pandas-dev/pandas/issues/21159) when trying to replace by regex pattern the error occurs:
df.COL.str.replace(regex_pat, '', regex=True)
...
--->
TypeError: Type aliases cannot be used with isinstance().
I would suggest to use pandas.Series.apply
function with precompiled regex pattern:
In [1170]: df4 = pd.DataFrame({'COL': ['hi A/P_90890 how A/P_True A/P_/93290 are AP_wueiwo A/P_|iwoeu you A/P_?9028k ?', 'Im fine, what A/P_49 A/P_0.0309 about you?']})
In [1171]: pat = re.compile(r'\s*A/?P_[^\s]*')
In [1172]: df4['COL']= df4.COL.apply(lambda x: pat.sub('', x))
In [1173]: df4
Out[1173]:
COL
0 hi how are you ?
1 Im fine, what about you?
Upvotes: 2
Reputation: 18906
How about:
(df['COL'].replace('A[/P|P][^ ]+', '', regex=True)
.replace('\s+',' ', regex=True))
Full example:
import pandas as pd
df = pd.DataFrame({
'COL':
["hi A/P_90890 how A/P_True A/P_/93290 AP_wueiwo A/P_|iwoeu you A/P_?9028k ?",
"Im fine, what A/P_49 A/P_0.0309 about you?"]
})
df['COL'] = (df['COL'].replace('A[/P|P][^ ]+', '', regex=True)
.replace('\s+',' ', regex=True))
Returns (oh, there is an extra space before ?):
COL
0 hi how you ?
1 Im fine, what about you?
Upvotes: 2