Reputation: 79

Pandas: Check if Series of strings is in Series with list of strings

I'm looking for a way to decide if a pandas Series of strings is contained in the values of a list of strings of another Series.

Preferably a one-liner - I'm aware that I can solve this by looping over the rows and building up a new series.

Example:

import pandas as pd
df = pd.DataFrame([
    {'value': 'foo', 'accepted_values': ['foo', 'bar']},
    {'value': 'bar', 'accepted_values': ['foo']},   
])

Desired output would be

pd.Series([True, False])

because 'foo' is in ['foo', 'bar'], but 'bar' is not in ['foo']

What I've tried:

df['value'].isin(df['accepted_values']), but that gives me [False, False]

Thanks!

Upvotes: 1

Answers (2)

the_RR

Reputation: 392

There are more efficient ways to carry out this operation.

Creating sample data for performance testing


    list_strings = ['apple', 'banana', 'orange', 'pineapple', 'strawberry', 'elephant', 'butterfly', 'rainbow', 'computer']
    
    rows_shape = 2_000_000
    df = pd.DataFrame({'value':[random.sample(list_strings, 1)[0] for y in range(rows_shape)],
                       'accepted_values':[random.sample(list_strings, 3) for y in range(rows_shape)]})

index	value	accepted_values
0	computer	`[orange, computer, pineapple]`
1	butterfly	`[banana, apple, orange]`
2	pineapple	`[elephant, computer, butterfly]`
3	pineapple	`[orange, elephant, apple]`
4	butterfly	`[elephant, butterfly, strawberry]`
...	...	...
999995	rainbow	`[computer, strawberry, rainbow]`
999996	pineapple	`[banana, apple, computer]`
999997	orange	`[butterfly, banana, rainbow]`
999998	rainbow	`[strawberry, banana, butterfly]`
999999	strawberry	`[strawberry, rainbow, elephant]`

    # 4.6 seconds - Using .apply
    df.assign(check=lambda x:x.apply(lambda row:row['value'] in row['accepted_values'],axis=1))
    
    # 0.431 seconds - Using Pandas Vectorized Operations
    df.assign(check=lambda z:
              z.pipe(lambda x:(
                  x[['value','accepted_values']]
                  .explode('accepted_values')
                  .pipe(lambda x:x['value'].eq(x['accepted_values']))
                  .groupby(level=0)
                  .any()))
             )

    # 0.124 seconds - Using Numpy Vectorized Operations
    df.assign(check=lambda x:x.pipe(lambda data: np.vectorize(
        lambda value, accepted_values: value in accepted_values)
                                    (data['value'].values,  
                                     data['accepted_values'].values)))

index	value	accepted_values	check
0	computer	`[orange, computer, pineapple]`	True
1	butterfly	`[banana, apple, orange]`	False
2	pineapple	`[elephant, computer, butterfly]`	False
3	pineapple	`[orange, elephant, apple]`	False
4	butterfly	`[elephant, butterfly, strawberry]`	True
...	...	...	...
999995	rainbow	`[computer, strawberry, rainbow]`	True
999996	pineapple	`[banana, apple, computer]`	False
999997	orange	`[butterfly, banana, rainbow]`	False
999998	rainbow	`[strawberry, banana, butterfly]`	False
999999	strawberry	`[strawberry, rainbow, elephant]`	True

vectorized pandas is x11 faster than using apply. vectorized numpy is x38 faster!

The strategy employed focused on utilizing Vectorized Operations for enhanced efficiency. Conversely, the apply method internally iterates through loops, resulting in slower processing and should be avoided whenever feasible.

Upvotes: 0

sophocles

Reputation: 13821

You can use apply with in:

df.apply(lambda r: r.value in r.accepted_values, axis=1)

0     True
1    False

Upvotes: 2

Pandas: Check if Series of strings is in Series with list of strings

Answers (2)

Related Questions