Reputation: 79
I'm looking for a way to decide if a pandas
Series of strings is contained in the values of a list of strings of another Series.
Preferably a one-liner - I'm aware that I can solve this by looping over the rows and building up a new series.
Example:
import pandas as pd
df = pd.DataFrame([
{'value': 'foo', 'accepted_values': ['foo', 'bar']},
{'value': 'bar', 'accepted_values': ['foo']},
])
Desired output would be
pd.Series([True, False])
because 'foo'
is in ['foo', 'bar']
, but 'bar'
is not in ['foo']
What I've tried:
df['value'].isin(df['accepted_values'])
, but that gives me [False, False]
Thanks!
Upvotes: 1
Views: 141
Reputation: 392
There are more efficient ways to carry out this operation.
Creating sample data for performance testing
list_strings = ['apple', 'banana', 'orange', 'pineapple', 'strawberry', 'elephant', 'butterfly', 'rainbow', 'computer']
rows_shape = 2_000_000
df = pd.DataFrame({'value':[random.sample(list_strings, 1)[0] for y in range(rows_shape)],
'accepted_values':[random.sample(list_strings, 3) for y in range(rows_shape)]})
index | value | accepted_values |
---|---|---|
0 | computer | [orange, computer, pineapple] |
1 | butterfly | [banana, apple, orange] |
2 | pineapple | [elephant, computer, butterfly] |
3 | pineapple | [orange, elephant, apple] |
4 | butterfly | [elephant, butterfly, strawberry] |
... | ... | ... |
999995 | rainbow | [computer, strawberry, rainbow] |
999996 | pineapple | [banana, apple, computer] |
999997 | orange | [butterfly, banana, rainbow] |
999998 | rainbow | [strawberry, banana, butterfly] |
999999 | strawberry | [strawberry, rainbow, elephant] |
# 4.6 seconds - Using .apply
df.assign(check=lambda x:x.apply(lambda row:row['value'] in row['accepted_values'],axis=1))
# 0.431 seconds - Using Pandas Vectorized Operations
df.assign(check=lambda z:
z.pipe(lambda x:(
x[['value','accepted_values']]
.explode('accepted_values')
.pipe(lambda x:x['value'].eq(x['accepted_values']))
.groupby(level=0)
.any()))
)
# 0.124 seconds - Using Numpy Vectorized Operations
df.assign(check=lambda x:x.pipe(lambda data: np.vectorize(
lambda value, accepted_values: value in accepted_values)
(data['value'].values,
data['accepted_values'].values)))
index | value | accepted_values | check |
---|---|---|---|
0 | computer | [orange, computer, pineapple] |
True |
1 | butterfly | [banana, apple, orange] |
False |
2 | pineapple | [elephant, computer, butterfly] |
False |
3 | pineapple | [orange, elephant, apple] |
False |
4 | butterfly | [elephant, butterfly, strawberry] |
True |
... | ... | ... | ... |
999995 | rainbow | [computer, strawberry, rainbow] |
True |
999996 | pineapple | [banana, apple, computer] |
False |
999997 | orange | [butterfly, banana, rainbow] |
False |
999998 | rainbow | [strawberry, banana, butterfly] |
False |
999999 | strawberry | [strawberry, rainbow, elephant] |
True |
vectorized pandas is x11 faster than using apply. vectorized numpy is x38 faster!
The strategy employed focused on utilizing Vectorized Operations for enhanced efficiency. Conversely, the apply method internally iterates through loops, resulting in slower processing and should be avoided whenever feasible.
Upvotes: 0
Reputation: 13821
You can use apply
with in
:
df.apply(lambda r: r.value in r.accepted_values, axis=1)
0 True
1 False
Upvotes: 2