gmfredit
gmfredit

Reputation: 53

Grouped count of combinations in Pandas column

I have a dataset with two values per person like the below and want to generate all combinations and counts of the combinations. I have a working solution but it's hardcoded and not scalable, I am looking for ideas on how to improve my solution.

Example:

d = {'person': [1,1,2,2,3,3,4,4,5,5,6,6], 'type': ['a','b','a','c','c','b','d','a','b','c','b','d']}
df = pd.DataFrame(data=d)
df

    person type
0        1    a
1        1    b
2        2    a
3        2    c
4        3    c
5        3    b
6        4    d
7        4    a
8        5    b
9        5    c
10       6    b
11       6    d

My Inefficient Solution:

df = pd.get_dummies(df)
typecols = [col for col in df.columns if 'type' in col]
df = df.groupby(['person'], as_index=False)[typecols].apply(lambda x: x.astype(int).sum())

df["a_b"] = df["type_a"] + df["type_b"]
df["a_c"] = df["type_a"] + df["type_c"]
df["a_d"] = df["type_a"] + df["type_d"]
df["b_c"] = df["type_b"] + df["type_c"]
df["b_d"] = df["type_b"] + df["type_d"]
df["c_d"] = df["type_c"] + df["type_d"]

df["a_b"] = df.apply(lambda x: 1 if x["a_b"] == 2 else 0, axis=1)
df["a_c"] = df.apply(lambda x: 1 if x["a_c"] == 2 else 0, axis=1)
df["a_d"] = df.apply(lambda x: 1 if x["a_d"] == 2 else 0, axis=1)
df["b_c"] = df.apply(lambda x: 1 if x["b_c"] == 2 else 0, axis=1)
df["b_d"] = df.apply(lambda x: 1 if x["b_d"] == 2 else 0, axis=1)
df["c_d"] = df.apply(lambda x: 1 if x["c_d"] == 2 else 0, axis=1)

df_sums = df[['a_b','a_c','a_d','b_c','b_d','c_d']].sum()
print(df_sums.to_markdown(tablefmt="grid"))

 +-----+-----+
|     |   0 |
+=====+=====+
| a_b |   1 |
+-----+-----+
| a_c |   1 |
+-----+-----+
| a_d |   1 |
+-----+-----+
| b_c |   2 |
+-----+-----+
| b_d |   1 |
+-----+-----+
| c_d |   0 |
+-----+-----+

This solution works because every person has exactly two distinct values from a list of six distinct values but would quickly become unmanageable if there were NULLS or more than six distinct.

Upvotes: 0

Views: 103

Answers (2)

ansev
ansev

Reputation: 30920

We can do:

s = df.sort_values('type').groupby('person', sort=False)['type']\
    .agg(tuple).value_counts()
s.index = [f'{x}_{y}' for x, y in s.index]
s = s.sort_index()
print(s)

a_b    1
a_c    1
a_d    1
b_c    2
b_d    1
Name: type, dtype: int64

get all the combinations is also simple:

from itertools import combinations
s = df.sort_values('type').groupby('person', sort=False)['type']\
      .agg(tuple).value_counts()\
      .reindex(list(combinations(df['type'].unique(), 2)), fill_value=0)


(a, b)    1
(a, c)    1
(a, d)    1
(b, c)    2
(b, d)    1
(c, d)    0
Name: type, dtype: int64

Upvotes: 1

ALollz
ALollz

Reputation: 59519

You can do a self-merge within person with a query to de-duplicate the matches (this is why we create the N column). Then we sort the types so we only get one of 'a_b' (and not also 'b_a'), create the labels, and take the value_counts. Using combinations we can get the list of all possibilities to reindex with.

import numpy as np
from itertools import combinations

ids = ['_'.join(x) for x in combinations(df['type'].unique(), 2)]
#['a_b', 'a_c', 'a_d', 'b_c', 'b_d', 'c_d']

df['N'] = range(len(df))
df1 = df.merge(df, on='person').query('N_x > N_y')

df1[['type_x', 'type_y']] = np.sort(df1[['type_x', 'type_y']].to_numpy(), 1)

df1['label'] = df1['type_x'].str.cat(df1['type_y'], sep='_')
df1['label'].value_counts().reindex(ids, fill_value=0)

a_b    1
a_c    1
a_d    1
b_c    2
b_d    1
c_d    0
Name: label, dtype: int64

Upvotes: 0

Related Questions