dofine
dofine

Reputation: 883

Pandas way of getting intersection between two rows in a python Pandas dataframe

Say I have some data that looks like below. I want to get the count of ids that have two tags at the same time.

tag id
a A
b B
a B
b A
c A

What I desire the result:

tag1 tag2 count
a b 2
a c 1
b c 1

In plain python I could write pseudocode:

d = defaultdict(set)
d[tag].add(id)
for tag1, tag2 in itertools.combinations(d.keys(), 2):
    print tag1, tag2, len(d[tag1] & d[tag2])

Not the most efficient way but it should work. Now I already have the data stored in Pandas dataframe. Is there a more pandas-way to achieve the same result?

Upvotes: 2

Views: 683

Answers (1)

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210842

Here is my attempt:

from itertools import combinations
import pandas as pd
import numpy as np

In [123]: df
Out[123]:
  tag id
0   a  A
1   b  B
2   a  B
3   b  A
4   c  A

In [124]: a = np.asarray(list(combinations(df.tag, 2)))

In [125]: a
Out[125]:
array([['a', 'b'],
       ['a', 'a'],
       ['a', 'b'],
       ['a', 'c'],
       ['b', 'a'],
       ['b', 'b'],
       ['b', 'c'],
       ['a', 'b'],
       ['a', 'c'],
       ['b', 'c']],
      dtype='<U1')

In [126]: a = a[a[:,0] != a[:,1]]

In [127]: a
Out[127]:
array([['a', 'b'],
       ['a', 'b'],
       ['a', 'c'],
       ['b', 'a'],
       ['b', 'c'],
       ['a', 'b'],
       ['a', 'c'],
       ['b', 'c']],
      dtype='<U1')

In [129]: np.ndarray.sort(a)

In [130]: pd.DataFrame(a).groupby([0,1]).size()
Out[130]:
0  1
a  b    4
   c    2
b  c    2
dtype: int64

Upvotes: 2

Related Questions