Reputation: 9345
I have a Pandas DataFrame like this:
A B
0 [C, D, E] C
1 [X, Y, Z] G
created from:
example = pd.DataFrame({"A":[["C", "D", "E"], ["X", "Y", "Z"]], "B":["C", "G"]})
I want to count how often a value occurs both in the list in column A
and under column B
.
So the correct output for value C
would be 1 and for value Z
would be 0. Any suggestions without resorting to going row-by-row (and losing out on vectorization)?
Thanks!
Upvotes: 0
Views: 162
Reputation: 51155
Not necessarily a vectorized approach, but using apply
:
df.apply(lambda x: x['B'] in x['A'], axis=1).astype(int)
0 1
1 0
dtype: int32
Edit: Not even including np.in1d
anymore because of how badly it scaled
Surprisingly, I got a huge performance boost using a basic list comprehension over apply
:
pd.Series([b in a for a, b in zip(df.A, df.B)]).astype(int)
Some timings:
df = pd.concat([df]*5000)
In [158]: %timeit pd.Series([b in a for a, b in zip(df.A, df.B)]).astype(int)
1.55 ms ± 40.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [159]: %timeit df.apply(lambda x: x['B'] in x['A'], axis=1).astype(int)
344 ms ± 1.58 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Upvotes: 1
Reputation: 3399
Here's an approach that simply explodes the list and counts using groupby
:
import pandas as pd
df = pd.DataFrame({"A":[["C", "D", "E"], ["X", "Y", "Z"]], "B":["C", "G"]})
df1 = pd.DataFrame([j, df.loc[i]['B']] for i in df.index for j in df.loc[i]['A'])
df1['same'] = (df1[0] == df1[1]).astype(int)
df1.groupby(0).same.sum()
Output:
0
C 1
D 0
E 0
X 0
Y 0
Z 0
Name: same, dtype: int64
Upvotes: 1