Rafiq
Rafiq

Reputation: 1642

Pandas Dataframe - How to check if the string value in column A is available in the list of string items in column B

Here is my dataframe which has two columns: Column A contains string and column B contains list of strings.

import pandas as pd

df = pd.DataFrame(columns=['A','B'])
df.loc[0] = ['apple',['orange','banana','blueberry']]
df.loc[1] = ['orange',['orange','banana','avocado']]
df.loc[2] = ['blueberry',['apple','banana','blueberry']]
df.loc[3] = ['cherry',['apple','orange','banana']]

print(df)

           A                            B
0      apple  [orange, banana, blueberry]
1     orange    [orange, banana, avocado]
2  blueberry   [apple, banana, blueberry]
3     cherry      [apple, orange, banana]

I want to check for each row to see if the value in column A is listed in the list in column B of the same row. So, the expected output should be:

0 False
1 True
2 True
3 False

I tried isin which works to check against a static list:

df.A.isin(['orange','banana','blueberry'])
0    False
1     True
2    False
3    False

However, when I try to use it to check the list items in the dataframe, it does not work:

df.A.isin(df.B)
TypeError: unhashable type: 'list'

I would like to avoid for loop and lambda if there is a solution available using Pandas.

Any help is greatly appreciated.

Upvotes: 2

Views: 236

Answers (2)

jezrael
jezrael

Reputation: 862691

The fastest is pure list comprehension with check by in:

m = pd.Series([i in j for i, j in zip(df.A, df.B)], index=x.index)
print (m)
0    False
1     True
2     True
3    False
dtype: bool

Solution with apply:

m = df.apply(lambda x: x.A in x.B, axis=1)
print (m)
0    False
1     True
2     True
3    False
dtype: bool

Thank you, @pir for use graph timings solution:

from numpy.core.defchararray import equal

def jez1(x):
    return pd.Series([i in j for i, j in zip(x.A, x.B)], index=x.index)

def jez2(x):
    return x.apply(lambda x: x.A in x.B, axis=1)

def pir1(x):
    return x.A.apply(lambda x: set([x])) <= x.B.apply(set)
def pir2(x):
    return pd.DataFrame(x.B.tolist(), x.index).eq(x.A, 0).any(1)
def pir3(x):
    return x.B.str.join('|').str.get_dummies().mul(pd.get_dummies(x.A)).any(1)

def pir4(x):
    return pd.Series(equal(x.A.values.astype(str), np.array(x.B.tolist()).T).any(0),x.index)

def pir5(x):   
    i = np.arange(len(x)).repeat(x.B.str.len())
    return pd.Series(np.bincount(i, x.A.values[i] == np.concatenate(x.B)).astype(bool),x.index)

res = pd.DataFrame(
    index=[10, 100, 500, 1000],
    columns='jez1 jez2 pir1 pir2 pir3 pir4 pir5'.split(),
    dtype=float
)

for i in res.index:
    d = pd.concat([df] * i, ignore_index=True)
    for j in res.columns:
        stmt = '{}(d)'.format(j)
        setp = 'from __main__ import d, {}'.format(j)
        res.at[i, j] = timeit(stmt, setp, number=100)

print (res.div(res.min(1), 0))
      jez1        jez2      pir1      pir2       pir3      pir4        pir5
10     1.0   13.235732  4.984622  5.687160  38.796462  1.132400    7.283616
100    1.0   79.879019  6.515313  5.159239  82.787444  1.963980   65.205917
500    1.0  162.672370  6.255446  2.761716  51.753635  3.506066   88.300689
1000   1.0  196.374333  8.813674  2.908213  63.753664  4.797193  125.889481

res.plot(loglog=True)

graph

Upvotes: 3

piRSquared
piRSquared

Reputation: 294288

Fun with sets

df.A.apply(lambda x: set([x])) <= df.B.apply(set)

0    False
1     True
2     True
3    False
dtype: bool

No loops

But I'd still use the @jezrael's comprehension

pd.DataFrame(df.B.tolist(), df.index).eq(df.A, 0).any(1)

0    False
1     True
2     True
3    False
dtype: bool

Numpy broadcasting

Only works if each list in B is of the same length.

from numpy.core.defchararray import equal

pd.Series(
    equal(df.A.values.astype(str), np.array(df.B.tolist()).T).any(0),
    df.index
)

0    False
1     True
2     True
3    False
dtype: bool

pd.get_dummies

df.B.str.join('|').str.get_dummies().mul(pd.get_dummies(df.A)).any(1)

0    False
1     True
2     True
3    False
dtype: bool

np.bincount

I like this one (-:
However, jezrael notes poor performance )-: so beware.

i = np.arange(len(df)).repeat(df.B.str.len())
pd.Series(
    np.bincount(i, df.A.values[i] == np.concatenate(df.B)).astype(bool),
    df.index
)

0    False
1     True
2     True
3    False
dtype: bool

Upvotes: 4

Related Questions