Grouped count of combinations in Pandas column

Question

I have a dataset with two values per person like the below and want to generate all combinations and counts of the combinations. I have a working solution but it's hardcoded and not scalable, I am looking for ideas on how to improve my solution.

Example:

d = {'person': [1,1,2,2,3,3,4,4,5,5,6,6], 'type': ['a','b','a','c','c','b','d','a','b','c','b','d']}
df = pd.DataFrame(data=d)
df

    person type
0        1    a
1        1    b
2        2    a
3        2    c
4        3    c
5        3    b
6        4    d
7        4    a
8        5    b
9        5    c
10       6    b
11       6    d

My Inefficient Solution:

df = pd.get_dummies(df)
typecols = [col for col in df.columns if 'type' in col]
df = df.groupby(['person'], as_index=False)[typecols].apply(lambda x: x.astype(int).sum())

df["a_b"] = df["type_a"] + df["type_b"]
df["a_c"] = df["type_a"] + df["type_c"]
df["a_d"] = df["type_a"] + df["type_d"]
df["b_c"] = df["type_b"] + df["type_c"]
df["b_d"] = df["type_b"] + df["type_d"]
df["c_d"] = df["type_c"] + df["type_d"]

df["a_b"] = df.apply(lambda x: 1 if x["a_b"] == 2 else 0, axis=1)
df["a_c"] = df.apply(lambda x: 1 if x["a_c"] == 2 else 0, axis=1)
df["a_d"] = df.apply(lambda x: 1 if x["a_d"] == 2 else 0, axis=1)
df["b_c"] = df.apply(lambda x: 1 if x["b_c"] == 2 else 0, axis=1)
df["b_d"] = df.apply(lambda x: 1 if x["b_d"] == 2 else 0, axis=1)
df["c_d"] = df.apply(lambda x: 1 if x["c_d"] == 2 else 0, axis=1)

df_sums = df[['a_b','a_c','a_d','b_c','b_d','c_d']].sum()
print(df_sums.to_markdown(tablefmt="grid"))

 +-----+-----+
|     |   0 |
+=====+=====+
| a_b |   1 |
+-----+-----+
| a_c |   1 |
+-----+-----+
| a_d |   1 |
+-----+-----+
| b_c |   2 |
+-----+-----+
| b_d |   1 |
+-----+-----+
| c_d |   0 |
+-----+-----+

This solution works because every person has exactly two distinct values from a list of six distinct values but would quickly become unmanageable if there were NULLS or more than six distinct.

ansev · Accepted Answer

We can do:

s = df.sort_values('type').groupby('person', sort=False)['type']\
    .agg(tuple).value_counts()
s.index = [f'{x}_{y}' for x, y in s.index]
s = s.sort_index()
print(s)

a_b    1
a_c    1
a_d    1
b_c    2
b_d    1
Name: type, dtype: int64

get all the combinations is also simple:

from itertools import combinations
s = df.sort_values('type').groupby('person', sort=False)['type']\
      .agg(tuple).value_counts()\
      .reindex(list(combinations(df['type'].unique(), 2)), fill_value=0)


(a, b)    1
(a, c)    1
(a, d)    1
(b, c)    2
(b, d)    1
(c, d)    0
Name: type, dtype: int64

Grouped count of combinations in Pandas column

Answers (2)

Related Questions