Reputation: 85
I am new to Pandas and not sure how to do the following:
I have a dataframe (df) with several columns. One column is called
OldCat = ['a-nn', 'bb-nm', 'ab-pp', 'ba-nn', 'cc-nm', 'ca-mn']
Now I want to create a new column that organizes/categories OldCat in a new way (NewCat). Specifically, I would like to look for strings in OldCat that contain either a(at the beginning), ba or ca j='^a|ba|ca' and then replace nan in NewCat with 'A' in the rows where OldCat contains j. In R this would be something like this:
j='^a|ba|ca'
df[(OldCat %like% j),NewCat := str_replace_all(df[(OldCat %like% j),NewCat], "nan", "A")]
Since R is relatively slow with my very large dataset, I decided to switch to pandas but I am not sure how to get the same result. Cheers
Upvotes: 1
Views: 393
Reputation: 394099
You can use the vectorised str.extract
to return matches with fillna
to replace NaN
with the string 'nan'
:
In [119]:
df['NewCat'] = df['OldCat'].str.extract('(^a|ba|ca)', expand=False).fillna('nan')
df
Out[119]:
OldCat NewCat
0 a a
1 bb nan
2 ab a
3 ba ba
4 cc nan
5 ca ca
EDIT
IIUC then you can use str.contains
with loc
to set just the rows that contain the strings:
In [137]:
df.loc[df['OldCat'].str.contains(r'a|ba|ca'), 'NewCat'] = 'A'
df['NewCat'].fillna('nan', inplace=True)
df
Out[137]:
OldCat NewCat
0 a-nn A
1 bb-nm nan
2 ab-pp A
3 ba-nn A
4 cc-nm nan
5 ca-mn A
Upvotes: 2