Reputation: 359
Suppose I have the following Pandas DataFrame:
a b
0 NAN BABA UN EQUITY
1 NAN 2018
2 NAN 2017
3 NAN 2016
4 NAN NAN
5 NAN 700 HK EQUITY
6 NAN 2018
7 NAN 2017
8 NAN 2016
9 NAN NAN
For each cell in column b
, I want to check if it contains the string EQUITY
. If it does, I want to replace the cells in column a
, the next row with the previous string until a row that is NAN
, to get the edited DataFrame as follows:
a b
0 NAN BABA UN EQUITY
1 BABA UN EQUITY 2018
2 BABA UN EQUITY 2017
3 BABA UN EQUITY 2016
4 NAN NAN
5 NAN 700 HK EQUITY
6 700 HK EQUITY 2018
7 700 HK EQUITY 2017
8 700 HK EQUITY 2016
9 NAN NAN
My actual DataFrame is much larger than the above, but the format is similar. I am having trouble figuring out how to check whether a cell contains EQUITY
. It seems that str.contains
is what I should be using, but it's not clear to me how to do that.
Upvotes: 23
Views: 86916
Reputation: 23051
str.contains
Even though they're vectorized, Pandas string methods are not optimized, so falling back to Python loops is often much faster, even for large dataframes.
For example, the following two are equivalent (yet the list comprehension is approx. 3 times faster no matter the length of the column):
msk1 = df['b'].str.contains(r'EQUITY', na=True)
msk = [s != s or 'EQUITY' in s for s in df['b'].tolist()]
all(msk == msk1) # True
That said, str.contains
is (arguably) much more readable; besides both versions perform very fast and unlikely to be a bottleneck for performance of a code.
"|".join
To check if any of a list of strings exist in rows of a column, join them with a |
separator and call str.contains
:
lst = ['EQUITY', '16', '19', '20']
msk = df['b'].str.contains(r'|'.join(lst), na=True)
where()
and mask()
The final desired output may be reached by masking certain values in column b
using the boolean mask msk
. First step is to remove the nonmatched values (numbers in the example in the OP) using where()
; then propagate values using ffill()
; finally, remove the values where the condition is True using mask()
.
msk = df['b'].str.contains(r'EQUITY', na=True)
df['a'] = df['b'].where(msk).ffill().mask(msk)
Upvotes: 1
Reputation: 879251
import numpy as np
import pandas as pd
df = pd.DataFrame({'a': ['NAN', 'NAN', 'NAN', 'NAN', 'NAN', 'NAN', 'NAN', 'NAN', 'NAN', 'NAN'],
'b': ['BABA UN EQUITY', '2018', '2017', '2016', 'NAN', '700 HK EQUITY', '2018', '2017', '2016', 'NAN']})
# Make sure that all NaN values are `np.nan` not `'NAN'` (strings)
df = df.replace('NAN', np.nan)
mask = df['b'].str.contains(r'EQUITY', na=True)
df.loc[mask, 'a'] = df['b']
df['a'] = df['a'].ffill()
df.loc[mask, 'a'] = np.nan
yields
a b
0 NaN BABA UN EQUITY
1 BABA UN EQUITY 2018
2 BABA UN EQUITY 2017
3 BABA UN EQUITY 2016
4 NaN NaN
5 NaN 700 HK EQUITY
6 700 HK EQUITY 2018
7 700 HK EQUITY 2017
8 700 HK EQUITY 2016
9 NaN NaN
One slightly tricky bit above is how mask
is defined. Notice that str.contains
returns a Series which contains not only True
and False
values, but also NaN
:
In [114]: df['b'].str.contains(r'EQUITY')
Out[114]:
0 True
1 False
2 False
3 False
4 NaN
5 True
6 False
7 False
8 False
9 NaN
Name: b, dtype: object
str.contains(..., na=True)
is used to make the NaN
s be treated as True
:
In [116]: df['b'].str.contains(r'EQUITY', na=True)
Out[116]:
0 True
1 False
2 False
3 False
4 True
5 True
6 False
7 False
8 False
9 True
Name: b, dtype: bool
Once you have mask
the idea is simple: Copy the values from b
into a
wherever mask
is True:
df.loc[mask, 'a'] = df['b']
Forward-fill the NaN values in a
:
df['a'] = df['a'].ffill()
Replace the values in a
with NaN wherever mask
is True:
df.loc[mask, 'a'] = np.nan
Upvotes: 35