omg_me
omg_me

Reputation: 79

Pandas: Check if Series of strings is in Series with list of strings

I'm looking for a way to decide if a pandas Series of strings is contained in the values of a list of strings of another Series.

Preferably a one-liner - I'm aware that I can solve this by looping over the rows and building up a new series.

Example:

import pandas as pd
df = pd.DataFrame([
    {'value': 'foo', 'accepted_values': ['foo', 'bar']},
    {'value': 'bar', 'accepted_values': ['foo']},   
])

Desired output would be

pd.Series([True, False])

because 'foo' is in ['foo', 'bar'], but 'bar' is not in ['foo']

What I've tried:

Thanks!

Upvotes: 1

Views: 141

Answers (2)

the_RR
the_RR

Reputation: 392

There are more efficient ways to carry out this operation.

Creating sample data for performance testing


    list_strings = ['apple', 'banana', 'orange', 'pineapple', 'strawberry', 'elephant', 'butterfly', 'rainbow', 'computer']
    
    rows_shape = 2_000_000
    df = pd.DataFrame({'value':[random.sample(list_strings, 1)[0] for y in range(rows_shape)],
                       'accepted_values':[random.sample(list_strings, 3) for y in range(rows_shape)]})

index value accepted_values
0 computer [orange, computer, pineapple]
1 butterfly [banana, apple, orange]
2 pineapple [elephant, computer, butterfly]
3 pineapple [orange, elephant, apple]
4 butterfly [elephant, butterfly, strawberry]
... ... ...
999995 rainbow [computer, strawberry, rainbow]
999996 pineapple [banana, apple, computer]
999997 orange [butterfly, banana, rainbow]
999998 rainbow [strawberry, banana, butterfly]
999999 strawberry [strawberry, rainbow, elephant]
    # 4.6 seconds - Using .apply
    df.assign(check=lambda x:x.apply(lambda row:row['value'] in row['accepted_values'],axis=1))
    
    # 0.431 seconds - Using Pandas Vectorized Operations
    df.assign(check=lambda z:
              z.pipe(lambda x:(
                  x[['value','accepted_values']]
                  .explode('accepted_values')
                  .pipe(lambda x:x['value'].eq(x['accepted_values']))
                  .groupby(level=0)
                  .any()))
             )

    # 0.124 seconds - Using Numpy Vectorized Operations
    df.assign(check=lambda x:x.pipe(lambda data: np.vectorize(
        lambda value, accepted_values: value in accepted_values)
                                    (data['value'].values,  
                                     data['accepted_values'].values)))

index value accepted_values check
0 computer [orange, computer, pineapple] True
1 butterfly [banana, apple, orange] False
2 pineapple [elephant, computer, butterfly] False
3 pineapple [orange, elephant, apple] False
4 butterfly [elephant, butterfly, strawberry] True
... ... ... ...
999995 rainbow [computer, strawberry, rainbow] True
999996 pineapple [banana, apple, computer] False
999997 orange [butterfly, banana, rainbow] False
999998 rainbow [strawberry, banana, butterfly] False
999999 strawberry [strawberry, rainbow, elephant] True

vectorized pandas is x11 faster than using apply. vectorized numpy is x38 faster!

The strategy employed focused on utilizing Vectorized Operations for enhanced efficiency. Conversely, the apply method internally iterates through loops, resulting in slower processing and should be avoided whenever feasible.

Upvotes: 0

sophocles
sophocles

Reputation: 13821

You can use apply with in:

df.apply(lambda r: r.value in r.accepted_values, axis=1)

0     True
1    False

Upvotes: 2

Related Questions