Pandas get most frequent values used together in the same column

Question

I have a dataset that contains only two columns user_id and channel. Channel column can assume values from a pre-defined list [a,b,c,d]. There are multiple rows with the same user_id. Each row can contain any of the above channels.

If I consider the unique channels that each user visited, what set occurs most frequently?

Example dataframe:

>>> df = pd.DataFrame([[1, 'a'], [1, 'b'], [1, 'b'], [1,'b'], [2,'c'], [2,'a'], [2,'a'], [2,'b'], [3,'a'], [3,'b']], columns=['user_id', 'Channel'])
>>> df
   user_id Channel
0        1       a
1        1       b
2        1       b
3        1       b
4        2       c
5        2       a
6        2       a
7        2       b
8        3       a
9        3       b

Expected solution:

for the above example would be something like:

For user_id == 1 the set of unique Channels is {a, b} and that counts once for that combination.
For user_id == 2 the set of unique Channels is {a, b, c} and that counts once for that combination. Note that this does not count for any subsets of these unique Channels.
For user_id == 3 the set of unique Channels is {a, b} and that counts once for that combination.

If we count the one combination of unique Channels for each user_id we should get

>>> df_result = pd.DataFrame([['a,b', 2], ['a,b,c', 1]], columns=['Channels_together', 'n'])
>>> df_result
  Channels_together  n
0               a,b  2
1             a,b,c  1

I have come up with a solution which is to pivot the table so that I get user_id, and columns a, b, c, d then assign an integer to each Channel column if not NA, then sum across columns and convert back the results to each combination.

I'm sure there is a better way to do this but I can't seem to find out how.

Erfan · Accepted Answer

You can use groupby.apply(set) and then count the values with .value_counts:

df.groupby('user_id')['Channel'].apply(set).value_counts()\
  .reset_index(name='n')\
  .rename(columns={'index':'Channels_together'})

Output

  Channels_together  n
0            {a, b}  2
1         {a, c, b}  1

If you want your values in str format we can write a lambda function to sort our set and convert it to string:

df.groupby('user_id')['Channel'].apply(lambda x: ', '.join(sorted(set(x)))).value_counts()\
  .reset_index(name='n')\
  .rename(columns={'index':'Channels_together'})

Output

  Channels_together  n
0              a, b  2
1           a, b, c  1

Pandas get most frequent values used together in the same column

Answers (2)

`frozenset`

Related Questions