Reputation: 11
How can I select a subset of a pandas dataframe based on the condition if a column which is a nested list contains a given string.
import pandas as pd
df = pd.DataFrame({'id': [12, 34, 43], 'course': ['Mathematics', 'Sport', 'Biology'], 'students': [['John Doe', 'Peter Parker', 'Lois Lane'], ['Bruce Banner', 'Lois Lane'], ['John Doe', 'Bruce Banner']]})
And now I would like to select all rows in which John Doe is in the students.
Upvotes: 1
Views: 503
Reputation: 14238
You can use str methods (first join list to ',' separated values and then look for 'John Doe'):
df[df['students'].str.join(',').str.match('John Doe')]
But actually the apply
method can be more performant.
The timeit
for a bigger dataframe containing 27 rows(repeated the original df):
%timeit df[df['students'].str.join(',').str.match('John Doe')]
382 µs ± 10.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit df[df.students.apply(lambda row: "John Doe" in row)]
271 µs ± 12.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Output:
id course students
0 12 Mathematics [John Doe, Peter Parker, Lois Lane]
2 43 Biology [John Doe, Bruce Banner]
Upvotes: 0
Reputation: 3720
Here is a vectorized option:
df[(df['students'].explode() == 'John Doe').groupby(level=0).any()]
Upvotes: 0