Check if Pandas DataFrame cell contains certain string

Question

Suppose I have the following Pandas DataFrame:

     a               b
0  NAN  BABA UN EQUITY
1  NAN            2018
2  NAN            2017
3  NAN            2016
4  NAN             NAN
5  NAN   700 HK EQUITY
6  NAN            2018
7  NAN            2017
8  NAN            2016
9  NAN             NAN

For each cell in column b, I want to check if it contains the string EQUITY. If it does, I want to replace the cells in column a, the next row with the previous string until a row that is NAN, to get the edited DataFrame as follows:

                a               b
0             NAN  BABA UN EQUITY
1  BABA UN EQUITY            2018
2  BABA UN EQUITY            2017
3  BABA UN EQUITY            2016
4             NAN             NAN
5             NAN   700 HK EQUITY
6   700 HK EQUITY            2018
7   700 HK EQUITY            2017
8   700 HK EQUITY            2016
9             NAN             NAN

My actual DataFrame is much larger than the above, but the format is similar. I am having trouble figuring out how to check whether a cell contains EQUITY. It seems that str.contains is what I should be using, but it's not clear to me how to do that.

unutbu · Accepted Answer

import numpy as np
import pandas as pd

df = pd.DataFrame({'a': ['NAN', 'NAN', 'NAN', 'NAN', 'NAN', 'NAN', 'NAN', 'NAN', 'NAN', 'NAN'],
 'b': ['BABA UN EQUITY', '2018', '2017', '2016', 'NAN', '700 HK EQUITY', '2018', '2017', '2016', 'NAN']})

# Make sure that all NaN values are `np.nan` not `'NAN'` (strings)
df = df.replace('NAN', np.nan)
mask = df['b'].str.contains(r'EQUITY', na=True)
df.loc[mask, 'a'] = df['b']
df['a'] = df['a'].ffill()
df.loc[mask, 'a'] = np.nan

yields

                a               b
0             NaN  BABA UN EQUITY
1  BABA UN EQUITY            2018
2  BABA UN EQUITY            2017
3  BABA UN EQUITY            2016
4             NaN             NaN
5             NaN   700 HK EQUITY
6   700 HK EQUITY            2018
7   700 HK EQUITY            2017
8   700 HK EQUITY            2016
9             NaN             NaN

One slightly tricky bit above is how mask is defined. Notice that str.contains returns a Series which contains not only True and False values, but also NaN:

In [114]: df['b'].str.contains(r'EQUITY')
Out[114]: 
0     True
1    False
2    False
3    False
4      NaN
5     True
6    False
7    False
8    False
9      NaN
Name: b, dtype: object

str.contains(..., na=True) is used to make the NaNs be treated as True:

In [116]: df['b'].str.contains(r'EQUITY', na=True)
Out[116]: 
0     True
1    False
2    False
3    False
4     True
5     True
6    False
7    False
8    False
9     True
Name: b, dtype: bool

Once you have mask the idea is simple: Copy the values from b into a wherever mask is True:

df.loc[mask, 'a'] = df['b']

Forward-fill the NaN values in a:

df['a'] = df['a'].ffill()

Replace the values in a with NaN wherever mask is True:

df.loc[mask, 'a'] = np.nan

Check if Pandas DataFrame cell contains certain string

Answers (2)

1. List comprehension is faster than `str.contains`

2. For multiple strings, use `"|".join`

3. Filtering can be done with `where()` and `mask()`

Related Questions

Check if Pandas DataFrame cell contains certain string

Answers (2)

1. List comprehension is faster than str.contains

2. For multiple strings, use "|".join

3. Filtering can be done with where() and mask()

Related Questions

1. List comprehension is faster than `str.contains`

2. For multiple strings, use `"|".join`

3. Filtering can be done with `where()` and `mask()`