Reputation: 1642
Here is my dataframe which has two columns: Column A contains string and column B contains list of strings.
import pandas as pd
df = pd.DataFrame(columns=['A','B'])
df.loc[0] = ['apple',['orange','banana','blueberry']]
df.loc[1] = ['orange',['orange','banana','avocado']]
df.loc[2] = ['blueberry',['apple','banana','blueberry']]
df.loc[3] = ['cherry',['apple','orange','banana']]
print(df)
A B
0 apple [orange, banana, blueberry]
1 orange [orange, banana, avocado]
2 blueberry [apple, banana, blueberry]
3 cherry [apple, orange, banana]
I want to check for each row to see if the value in column A is listed in the list in column B of the same row. So, the expected output should be:
0 False
1 True
2 True
3 False
I tried isin
which works to check against a static list:
df.A.isin(['orange','banana','blueberry'])
0 False
1 True
2 False
3 False
However, when I try to use it to check the list items in the dataframe, it does not work:
df.A.isin(df.B)
TypeError: unhashable type: 'list'
I would like to avoid for loop and lambda if there is a solution available using Pandas.
Any help is greatly appreciated.
Upvotes: 2
Views: 236
Reputation: 862691
The fastest is pure list comprehension with check by in
:
m = pd.Series([i in j for i, j in zip(df.A, df.B)], index=x.index)
print (m)
0 False
1 True
2 True
3 False
dtype: bool
Solution with apply
:
m = df.apply(lambda x: x.A in x.B, axis=1)
print (m)
0 False
1 True
2 True
3 False
dtype: bool
Thank you, @pir for use graph timings solution:
from numpy.core.defchararray import equal
def jez1(x):
return pd.Series([i in j for i, j in zip(x.A, x.B)], index=x.index)
def jez2(x):
return x.apply(lambda x: x.A in x.B, axis=1)
def pir1(x):
return x.A.apply(lambda x: set([x])) <= x.B.apply(set)
def pir2(x):
return pd.DataFrame(x.B.tolist(), x.index).eq(x.A, 0).any(1)
def pir3(x):
return x.B.str.join('|').str.get_dummies().mul(pd.get_dummies(x.A)).any(1)
def pir4(x):
return pd.Series(equal(x.A.values.astype(str), np.array(x.B.tolist()).T).any(0),x.index)
def pir5(x):
i = np.arange(len(x)).repeat(x.B.str.len())
return pd.Series(np.bincount(i, x.A.values[i] == np.concatenate(x.B)).astype(bool),x.index)
res = pd.DataFrame(
index=[10, 100, 500, 1000],
columns='jez1 jez2 pir1 pir2 pir3 pir4 pir5'.split(),
dtype=float
)
for i in res.index:
d = pd.concat([df] * i, ignore_index=True)
for j in res.columns:
stmt = '{}(d)'.format(j)
setp = 'from __main__ import d, {}'.format(j)
res.at[i, j] = timeit(stmt, setp, number=100)
print (res.div(res.min(1), 0))
jez1 jez2 pir1 pir2 pir3 pir4 pir5
10 1.0 13.235732 4.984622 5.687160 38.796462 1.132400 7.283616
100 1.0 79.879019 6.515313 5.159239 82.787444 1.963980 65.205917
500 1.0 162.672370 6.255446 2.761716 51.753635 3.506066 88.300689
1000 1.0 196.374333 8.813674 2.908213 63.753664 4.797193 125.889481
res.plot(loglog=True)
Upvotes: 3
Reputation: 294288
sets
df.A.apply(lambda x: set([x])) <= df.B.apply(set)
0 False
1 True
2 True
3 False
dtype: bool
But I'd still use the @jezrael's comprehension
pd.DataFrame(df.B.tolist(), df.index).eq(df.A, 0).any(1)
0 False
1 True
2 True
3 False
dtype: bool
Only works if each list in B
is of the same length.
from numpy.core.defchararray import equal
pd.Series(
equal(df.A.values.astype(str), np.array(df.B.tolist()).T).any(0),
df.index
)
0 False
1 True
2 True
3 False
dtype: bool
pd.get_dummies
df.B.str.join('|').str.get_dummies().mul(pd.get_dummies(df.A)).any(1)
0 False
1 True
2 True
3 False
dtype: bool
np.bincount
I like this one (-:
However, jezrael notes poor performance )-: so beware.
i = np.arange(len(df)).repeat(df.B.str.len())
pd.Series(
np.bincount(i, df.A.values[i] == np.concatenate(df.B)).astype(bool),
df.index
)
0 False
1 True
2 True
3 False
dtype: bool
Upvotes: 4