Reputation:

Combination aggregations in Pandas

I have data in this format

ID Val
1 A
1 B
1 C
2 A
2 C
2 D

I want to group by data at each ID and see combinations that exist and sum the multiple combinations up. The resulting output should look like

v1 v2 count
A B 1
A C 2
A D 1
B C 1
C D 1

Is there a smart way to get this instead of looping through each possible combinations?

Upvotes: 1

Answers (2)

behzad.nouri

Reputation: 78031

this should work:

>>> ts = df.groupby('Val')['ID'].aggregate(lambda ts: set(ts))
>>> ts
Val
A      set([1, 2])
B         set([1])
C      set([1, 2])
D         set([2])
Name: ID, dtype: object
>>> from itertools import product
>>> pd.DataFrame([[i, j, len(ts[i] & ts[j])] for i, j in product(ts.index, ts.index) if i < j], 
...              columns=['v1', 'v2', 'count'])
  v1 v2  count
0  A  B      1
1  A  C      2
2  A  D      1
3  B  C      1
4  B  D      0
5  C  D      1

Upvotes: 3

user41047

Reputation: 326

What I came up with:

Use pd.merge to create the cartesian product
Filter the cartesian product to include only combinations of the form that you desire
Count the number of combinations
Convert to the desired dataframe format

Unsure if it is faster than looping through all possible combinations.

#!/usr/bin/env python2.7
# encoding: utf-8
'''
'''
import pandas as pd
from itertools import izip

# Create the dataframe
df = pd.DataFrame([
    [1, 'A'],
    [1, 'B'],
    [1, 'C'],
    [2, 'A'],
    [2, 'C'],
    [2, 'D'],
], columns=['ID', 'Val'])
'''
   ID Val
0   1   A
1   1   B
2   1   C
3   2   A
4   2   C
5   2   D

[6 rows x 2 columns]
'''

# Create the cartesian product
df2 = pd.merge(df, df, on='ID')
'''
    ID Val_x Val_y
0    1     A     A
1    1     A     B
2    1     A     C
3    1     B     A
4    1     B     B
5    1     B     C
6    1     C     A
7    1     C     B
8    1     C     C
9    2     A     A
10   2     A     C
11   2     A     D
12   2     C     A
13   2     C     C
14   2     C     D
15   2     D     A
16   2     D     C
17   2     D     D

[18 rows x 3 columns]
'''

# Count the values, filtering A, A pairs, and B, A pairs.
counts = pd.Series([
    v for v in izip(df2.Val_x, df2.Val_y)
    if v[0] != v[1] and v[0] < v[1]
]).value_counts(sort=False).sort_index()
'''
(A, B)    1
(A, C)    2
(A, D)    1
(B, C)    1
(C, D)    1
dtype: int64
'''

# Combine the counts
df3 = pd.DataFrame(dict(
    v1=[v1 for v1, _ in counts.index],
    v2=[v2 for _, v2 in counts.index],
    count=counts.values
))
'''
   count v1 v2
0      1  A  B
1      2  A  C
2      1  A  D
3      1  B  C
4      1  C  D
'''

Upvotes: 2

Combination aggregations in Pandas

Answers (2)

Related Questions