Hinshman
Hinshman

Reputation: 11

Select rows of pandas dataframe based on string in nested list

How can I select a subset of a pandas dataframe based on the condition if a column which is a nested list contains a given string.

import pandas as pd

df = pd.DataFrame({'id': [12, 34, 43], 'course': ['Mathematics', 'Sport', 'Biology'], 'students': [['John Doe', 'Peter Parker', 'Lois Lane'], ['Bruce Banner', 'Lois Lane'], ['John Doe', 'Bruce Banner']]})

And now I would like to select all rows in which John Doe is in the students.

Upvotes: 1

Views: 503

Answers (3)

SomeDude
SomeDude

Reputation: 14238

You can use str methods (first join list to ',' separated values and then look for 'John Doe'):

df[df['students'].str.join(',').str.match('John Doe')]

But actually the apply method can be more performant.

The timeit for a bigger dataframe containing 27 rows(repeated the original df):

%timeit df[df['students'].str.join(',').str.match('John Doe')]
382 µs ± 10.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df[df.students.apply(lambda row: "John Doe" in row)]
271 µs ± 12.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Output:

   id       course                             students
0  12  Mathematics  [John Doe, Peter Parker, Lois Lane]
2  43      Biology             [John Doe, Bruce Banner]

Upvotes: 0

sitting_duck
sitting_duck

Reputation: 3720

Here is a vectorized option:

df[(df['students'].explode() == 'John Doe').groupby(level=0).any()]

Upvotes: 0

the_pr0blem
the_pr0blem

Reputation: 311

df[df.students.apply(lambda row: "John Doe" in row)]

Upvotes: 1

Related Questions