user2517372
user2517372

Reputation:

Combination aggregations in Pandas

I have data in this format

ID Val
1 A
1 B
1 C
2 A
2 C
2 D

I want to group by data at each ID and see combinations that exist and sum the multiple combinations up. The resulting output should look like

v1 v2 count
A B 1
A C 2
A D 1
B C 1
C D 1

Is there a smart way to get this instead of looping through each possible combinations?

Upvotes: 1

Views: 927

Answers (2)

behzad.nouri
behzad.nouri

Reputation: 78031

this should work:

>>> ts = df.groupby('Val')['ID'].aggregate(lambda ts: set(ts))
>>> ts
Val
A      set([1, 2])
B         set([1])
C      set([1, 2])
D         set([2])
Name: ID, dtype: object
>>> from itertools import product
>>> pd.DataFrame([[i, j, len(ts[i] & ts[j])] for i, j in product(ts.index, ts.index) if i < j], 
...              columns=['v1', 'v2', 'count'])
  v1 v2  count
0  A  B      1
1  A  C      2
2  A  D      1
3  B  C      1
4  B  D      0
5  C  D      1

Upvotes: 3

user41047
user41047

Reputation: 326

What I came up with:

  • Use pd.merge to create the cartesian product
  • Filter the cartesian product to include only combinations of the form that you desire
  • Count the number of combinations
  • Convert to the desired dataframe format

Unsure if it is faster than looping through all possible combinations.

#!/usr/bin/env python2.7
# encoding: utf-8
'''
'''
import pandas as pd
from itertools import izip

# Create the dataframe
df = pd.DataFrame([
    [1, 'A'],
    [1, 'B'],
    [1, 'C'],
    [2, 'A'],
    [2, 'C'],
    [2, 'D'],
], columns=['ID', 'Val'])
'''
   ID Val
0   1   A
1   1   B
2   1   C
3   2   A
4   2   C
5   2   D

[6 rows x 2 columns]
'''

# Create the cartesian product
df2 = pd.merge(df, df, on='ID')
'''
    ID Val_x Val_y
0    1     A     A
1    1     A     B
2    1     A     C
3    1     B     A
4    1     B     B
5    1     B     C
6    1     C     A
7    1     C     B
8    1     C     C
9    2     A     A
10   2     A     C
11   2     A     D
12   2     C     A
13   2     C     C
14   2     C     D
15   2     D     A
16   2     D     C
17   2     D     D

[18 rows x 3 columns]
'''

# Count the values, filtering A, A pairs, and B, A pairs.
counts = pd.Series([
    v for v in izip(df2.Val_x, df2.Val_y)
    if v[0] != v[1] and v[0] < v[1]
]).value_counts(sort=False).sort_index()
'''
(A, B)    1
(A, C)    2
(A, D)    1
(B, C)    1
(C, D)    1
dtype: int64
'''

# Combine the counts
df3 = pd.DataFrame(dict(
    v1=[v1 for v1, _ in counts.index],
    v2=[v2 for _, v2 in counts.index],
    count=counts.values
))
'''
   count v1 v2
0      1  A  B
1      2  A  C
2      1  A  D
3      1  B  C
4      1  C  D
'''

Upvotes: 2

Related Questions