anon_swe
anon_swe

Reputation: 9345

Pandas: Check Column Membership in Other Column (Same Row)

I have a Pandas DataFrame like this:

       A        B
0   [C, D, E]   C
1   [X, Y, Z]   G

created from:

example = pd.DataFrame({"A":[["C", "D", "E"], ["X", "Y", "Z"]], "B":["C", "G"]})

I want to count how often a value occurs both in the list in column A and under column B.

So the correct output for value C would be 1 and for value Z would be 0. Any suggestions without resorting to going row-by-row (and losing out on vectorization)?

Thanks!

Upvotes: 0

Views: 162

Answers (2)

user3483203
user3483203

Reputation: 51155

Not necessarily a vectorized approach, but using apply:

df.apply(lambda x: x['B'] in x['A'], axis=1).astype(int)

0    1
1    0
dtype: int32

Edit: Not even including np.in1d anymore because of how badly it scaled

Surprisingly, I got a huge performance boost using a basic list comprehension over apply:

pd.Series([b in a for a, b in zip(df.A, df.B)]).astype(int)

Some timings:

df = pd.concat([df]*5000)

In [158]: %timeit pd.Series([b in a for a, b in zip(df.A, df.B)]).astype(int)
1.55 ms ± 40.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [159]: %timeit df.apply(lambda x: x['B'] in x['A'], axis=1).astype(int)
344 ms ± 1.58 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Upvotes: 1

Ashish Acharya
Ashish Acharya

Reputation: 3399

Here's an approach that simply explodes the list and counts using groupby:

import pandas as pd

df = pd.DataFrame({"A":[["C", "D", "E"], ["X", "Y", "Z"]], "B":["C", "G"]})

df1 = pd.DataFrame([j, df.loc[i]['B']] for i in df.index for j in df.loc[i]['A'])

df1['same'] = (df1[0] == df1[1]).astype(int)

df1.groupby(0).same.sum()

Output:

0
C    1
D    0
E    0
X    0
Y    0
Z    0
Name: same, dtype: int64

Upvotes: 1

Related Questions