Find most common pairs across multiple columns

Question

I have a pandas dataframe that looks like this:

Client   1_act      2_act      3_act      4_act      5_act      6_act   ...
1        hiking     swimming   skating    jumping    climbing   eating
2        eating     hiking     climbing   exploring  
3        hiking     exercising 
4        hiking     screaming  yelling    hopping    swimming  
...

Every row only has unique activities and can have MANY columns all named #_act (new columns can also be added at any time as a client reports a new activity). Every row has at least a pair (no one has less than 2 activities per row). New activity values can also be added at any time.

I am trying to find a way to return the most common pair of activity. So desired output would be like:

Pair                       Qty
(hiking, swimming)         2
(hiking, skating)          1
(hiking, jumping)          1
(hiking, climbing)         2
(hiking, eating)           2
(swimming, skating)        1
(swimming, jumping)        1
(swimming, climbing)       1
(swimming, eating)         1
(skating, jumping)         1
(skating, climbing)        1
(skating, eating)          1
(jumping, climbing)        1
(climbing, eating)         2
(eating, exploring)        1
(hiking, exercising)       1
(hiking, screaming)        1
(hiking, yelling)          1
(hiking, hopping)          1

...

The above output is an example output of all possible pairs in each row across all their columns in this sample dataset. If a pair repeated in subsequent rows it should add to the quantity, and if a new pair occurs in subsequent rows, it should be added to the column of pairs as a new row.

The purpose of this is to see what pair of activities are the most common among all clients. Any help would be greatly appreciated! Thank you!!

jezrael · Accepted Answer

Use combinations in list comprehension for flattening, count tuples by Counter and pass to DataFrame constructor:

from collections import Counter
from  itertools import combinations

df = df.set_index('Client')

c = Counter([y for x in df.values for y in combinations(x, 2)])
df = pd.DataFrame({'Pair': list(c.keys()), 'Qty': list(c.values())})

For top combinations:

n = 10
L = Counter([y for x in df.values for y in combinations(x, 2)]).most_common(n)

df = pd.DataFrame(L, columns=['Pair', 'Qty'])
print (df)
                   Pair  Qty
0    (hiking, swimming)    2
1    (hiking, climbing)    2
2      (hiking, eating)    2
3    (swimming, eating)    2
4     (hiking, hopping)    2
5     (hiking, skating)    1
6     (hiking, jumping)    1
7   (swimming, skating)    1
8   (swimming, jumping)    1
9  (swimming, climbing)    1

Find most common pairs across multiple columns

Answers (1)

Related Questions