Reputation: 59
Suppose I have the following dataframe
df = pd.DataFrame({'col1': ['one','one', 'one', 'one', 'two'],
'col2': ['two','two','four','four','two'],
'col3': [['alpha', 'beta'],
['alpha', 'beta'],
['alpha', 'beta'],
['alpha', 'beta'],
['alpha', 'nodata', 'beta', 'gamma']]})
I know I can subset with:
df[df['col2']=='four']
How do I subset so that it matches a string INSIDE of a list? in this example, subset the rows that don't contain 'nodata' in col3?
df[~df['col3'].str.contains('nodata')
doesn't seem to work and I can't properly seem to access the 'right' item inside of the list.
Upvotes: 2
Views: 768
Reputation: 13705
Rather than converting data types you can use apply
with a lambda
function which will be a bit faster.
df[~df.col3.apply(lambda x: 'nodata' in x)]
Testing it on a larger dataset:
In [86]: df.shape
Out[86]: (5000, 3)
My solution:
In [88]: %timeit df[~df.col3.apply(lambda x: 'nodata' in x)]
1000 loops, best of 3: 1.68 ms per loop
Previous solution:
In [87]: %timeit df[~df['col3'].astype(str).str.contains('nodata')]
100 loops, best of 3: 7.8 ms per loop
Arguably the first answer may be more readable though.
Upvotes: 3
Reputation: 36545
Your code should work if you convert the column's datatype to string:
df[~df['col3'].astype(str).str.contains('nodata')]
Upvotes: 1