Subsetting pandas dataframe with list in cell

Question

Suppose I have the following dataframe

df = pd.DataFrame({'col1': ['one','one', 'one', 'one', 'two'],
                   'col2': ['two','two','four','four','two'],
                   'col3': [['alpha', 'beta'],
                            ['alpha', 'beta'],
                            ['alpha', 'beta'],
                            ['alpha', 'beta'],
                            ['alpha', 'nodata', 'beta', 'gamma']]})

I know I can subset with:

df[df['col2']=='four']

How do I subset so that it matches a string INSIDE of a list? in this example, subset the rows that don't contain 'nodata' in col3?

df[~df['col3'].str.contains('nodata')

doesn't seem to work and I can't properly seem to access the 'right' item inside of the list.

johnchase · Accepted Answer

Rather than converting data types you can use apply with a lambda function which will be a bit faster.

df[~df.col3.apply(lambda x: 'nodata' in x)]

Testing it on a larger dataset:

In [86]: df.shape
Out[86]: (5000, 3)

My solution:

In [88]: %timeit df[~df.col3.apply(lambda x: 'nodata' in x)]
         1000 loops, best of 3: 1.68 ms per loop

Previous solution:

In [87]: %timeit df[~df['col3'].astype(str).str.contains('nodata')]
         100 loops, best of 3: 7.8 ms per loop

Arguably the first answer may be more readable though.

Subsetting pandas dataframe with list in cell

Answers (2)

Related Questions