Reputation: 53
I have a dataset with two values per person like the below and want to generate all combinations and counts of the combinations. I have a working solution but it's hardcoded and not scalable, I am looking for ideas on how to improve my solution.
Example:
d = {'person': [1,1,2,2,3,3,4,4,5,5,6,6], 'type': ['a','b','a','c','c','b','d','a','b','c','b','d']}
df = pd.DataFrame(data=d)
df
person type
0 1 a
1 1 b
2 2 a
3 2 c
4 3 c
5 3 b
6 4 d
7 4 a
8 5 b
9 5 c
10 6 b
11 6 d
My Inefficient Solution:
df = pd.get_dummies(df)
typecols = [col for col in df.columns if 'type' in col]
df = df.groupby(['person'], as_index=False)[typecols].apply(lambda x: x.astype(int).sum())
df["a_b"] = df["type_a"] + df["type_b"]
df["a_c"] = df["type_a"] + df["type_c"]
df["a_d"] = df["type_a"] + df["type_d"]
df["b_c"] = df["type_b"] + df["type_c"]
df["b_d"] = df["type_b"] + df["type_d"]
df["c_d"] = df["type_c"] + df["type_d"]
df["a_b"] = df.apply(lambda x: 1 if x["a_b"] == 2 else 0, axis=1)
df["a_c"] = df.apply(lambda x: 1 if x["a_c"] == 2 else 0, axis=1)
df["a_d"] = df.apply(lambda x: 1 if x["a_d"] == 2 else 0, axis=1)
df["b_c"] = df.apply(lambda x: 1 if x["b_c"] == 2 else 0, axis=1)
df["b_d"] = df.apply(lambda x: 1 if x["b_d"] == 2 else 0, axis=1)
df["c_d"] = df.apply(lambda x: 1 if x["c_d"] == 2 else 0, axis=1)
df_sums = df[['a_b','a_c','a_d','b_c','b_d','c_d']].sum()
print(df_sums.to_markdown(tablefmt="grid"))
+-----+-----+
| | 0 |
+=====+=====+
| a_b | 1 |
+-----+-----+
| a_c | 1 |
+-----+-----+
| a_d | 1 |
+-----+-----+
| b_c | 2 |
+-----+-----+
| b_d | 1 |
+-----+-----+
| c_d | 0 |
+-----+-----+
This solution works because every person has exactly two distinct values from a list of six distinct values but would quickly become unmanageable if there were NULLS or more than six distinct.
Upvotes: 0
Views: 103
Reputation: 30920
We can do:
s = df.sort_values('type').groupby('person', sort=False)['type']\
.agg(tuple).value_counts()
s.index = [f'{x}_{y}' for x, y in s.index]
s = s.sort_index()
print(s)
a_b 1
a_c 1
a_d 1
b_c 2
b_d 1
Name: type, dtype: int64
get all the combinations is also simple:
from itertools import combinations
s = df.sort_values('type').groupby('person', sort=False)['type']\
.agg(tuple).value_counts()\
.reindex(list(combinations(df['type'].unique(), 2)), fill_value=0)
(a, b) 1
(a, c) 1
(a, d) 1
(b, c) 2
(b, d) 1
(c, d) 0
Name: type, dtype: int64
Upvotes: 1
Reputation: 59519
You can do a self-merge within person with a query to de-duplicate the matches (this is why we create the N
column). Then we sort the types so we only get one of 'a_b' (and not also 'b_a'), create the labels, and take the value_counts
. Using combinations
we can get the list of all possibilities to reindex with.
import numpy as np
from itertools import combinations
ids = ['_'.join(x) for x in combinations(df['type'].unique(), 2)]
#['a_b', 'a_c', 'a_d', 'b_c', 'b_d', 'c_d']
df['N'] = range(len(df))
df1 = df.merge(df, on='person').query('N_x > N_y')
df1[['type_x', 'type_y']] = np.sort(df1[['type_x', 'type_y']].to_numpy(), 1)
df1['label'] = df1['type_x'].str.cat(df1['type_y'], sep='_')
df1['label'].value_counts().reindex(ids, fill_value=0)
a_b 1
a_c 1
a_d 1
b_c 2
b_d 1
c_d 0
Name: label, dtype: int64
Upvotes: 0