Reputation: 1945
I have a dataset that contains only two columns user_id
and channel
. Channel column can assume values from a pre-defined list [a,b,c,d]
. There are multiple rows with the same user_id
. Each row can contain any of the above channels.
If I consider the unique channels that each user visited, what set occurs most frequently?
Example dataframe:
>>> df = pd.DataFrame([[1, 'a'], [1, 'b'], [1, 'b'], [1,'b'], [2,'c'], [2,'a'], [2,'a'], [2,'b'], [3,'a'], [3,'b']], columns=['user_id', 'Channel'])
>>> df
user_id Channel
0 1 a
1 1 b
2 1 b
3 1 b
4 2 c
5 2 a
6 2 a
7 2 b
8 3 a
9 3 b
Expected solution:
for the above example would be something like:
user_id == 1
the set of unique Channels is {a, b}
and that counts once for that combination.user_id == 2
the set of unique Channels is {a, b, c}
and that counts once for that combination. Note that this does not count for any subsets of these unique Channels.user_id == 3
the set of unique Channels is {a, b}
and that counts once for that combination.If we count the one combination of unique Channels for each user_id
we should get
>>> df_result = pd.DataFrame([['a,b', 2], ['a,b,c', 1]], columns=['Channels_together', 'n'])
>>> df_result
Channels_together n
0 a,b 2
1 a,b,c 1
I have come up with a solution which is to pivot the table so that I get user_id
, and columns a
, b
, c
, d
then assign an integer to each Channel column if not NA, then sum across columns and convert back the results to each combination.
I'm sure there is a better way to do this but I can't seem to find out how.
Upvotes: 1
Views: 215
Reputation: 294218
frozenset
Is hashable and can be counted
df.groupby('user_id').Channel.apply(frozenset).value_counts()
(a, b) 2
(a, b, c) 1
Name: Channel, dtype: int64
And we can tailor this to precisely what OP has with
c = df.groupby('user_id').Channel.apply(frozenset).value_counts()
pd.DataFrame({'Channels_together': c.index.str.join(', '), 'n': c.values})
Channels_together n
0 a, b 2
1 a, b, c 1
Alternatively
df.groupby('user_id').Channel.apply(frozenset).str.join(', ') \
.value_counts().rename_axis('Channels_together').reset_index(name='n')
Channels_together n
0 a, b 2
1 a, b, c 1
Upvotes: 2
Reputation: 42886
You can use groupby.apply(set)
and then count the values with .value_counts
:
df.groupby('user_id')['Channel'].apply(set).value_counts()\
.reset_index(name='n')\
.rename(columns={'index':'Channels_together'})
Output
Channels_together n
0 {a, b} 2
1 {a, c, b} 1
If you want your values in str
format we can write a lambda
function to sort our set and convert it to string:
df.groupby('user_id')['Channel'].apply(lambda x: ', '.join(sorted(set(x)))).value_counts()\
.reset_index(name='n')\
.rename(columns={'index':'Channels_together'})
Output
Channels_together n
0 a, b 2
1 a, b, c 1
Upvotes: 4