Reputation: 2923
How can I use .isin for pandas where it will use values from each of the rows in the dataframe, and not static values.
For example lets say we have dataframe like:
import pandas as pd
import datetime
l = []
for i in range(100000):
d = {'a':i,'b':{1,2,3},'c':0}
l.append(d)
df = pd.DataFrame(l)
If I use .isin, it can only take 1 list of values (in this example {1,2,3}) and will be compared to each of the values in the column you want to compare (ie df['a'])
test = df['a'].isin({1,2,3})
If I want to compare each value of the column 'b' if values in 'a' is in df['b'] I can do the following below:
def check(a, b):
return a in b
test = list(map(check, df['a'], df['b']))
Of course in this example all values in df['b'] is the same, but can pretend it is not.
Unfortunately this is about 5x slower than just using the .isin. My question is, is there a way to use .isin but for each of the values in df['b]? Or dont have to necessarily use .isin, but what would be a more efficient way to do it?
Upvotes: 1
Views: 2034
Reputation: 42946
You can use DataFrame.apply
with in
here:
df.apply(lambda x: x['a'] in x['b'], axis=1)
0 False
1 True
2 True
3 True
4 False
...
99995 False
99996 False
99997 False
99998 False
99999 False
Length: 100000, dtype: bool
Or list_comprehension
with zip
which is faster:
[a in b for a, b in zip(df['a'], df['b'])]
[False,
True,
True,
True,
False,
False,
False,
False,
False,
False,
False,
False,
False,
...]
%%timeit
def check(a, b):
return a in b
list(map(check, df['a'], df['b']))
28.6 ms ± 1.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
[a in b for a, b in zip(df['a'], df['b'])]
22.5 ms ± 851 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
df.apply(lambda x: x['a'] in x['b'], axis=1)
2.27 s ± 29 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Upvotes: 3