Sanjay Manohar
Sanjay Manohar

Reputation: 7026

Pandas groupby with tag-style list

I have a dataset with 'tag-like' groupings:

     Id       tags
0    item1    ['friends','family']
1    item2    ['friends']
2    item3    []
3    item4    ['family','holiday']

So a row can belong to several groups. I want to create an object similar to groupby, so that I can use agg etc.

df.groupby('tags').count()

expected result

     tags          count
0    'friends'    2
1    'family'     2
2    'holiday'    1

But of course it won't work because it treats the whole list as the key, rather than the individual tags. Here's an attempt

tagset = set(df.tags.explode())
grpby  = { t: df.loc[df.tags.str.contains(t, regex=False)] 
           for t in tagset }

From what I understand, groupby objects are structured a bit like this. But how to make it a groupby object? So that I can do things like grpby.year.mean() etc?

Upvotes: 2

Views: 53

Answers (2)

mozway
mozway

Reputation: 260130

You can't have a row belong to multiple groups like your grpby object. Thus what you want to do is impossible in pure pandas, unless you duplicate the rows with explode, then you will be able to groupby.agg:

out = (df.explode('tags')
         .groupby('tags', as_index=False)
         .agg(**{'count': ('tags', 'size')})
      )

Output:

      tags  count
0   family      2
1  friends      2
2  holiday      1

With a more meaningful aggregation:

out = (df.explode('tags')
         .groupby('tags', as_index=False)
         .agg({'Id': frozenset})
      )

Output:

      tags              Id
0   family  (item4, item1)
1  friends  (item2, item1)
2  holiday         (item4)

Note however that explode is quite expensive, so if you just want to count the tags, better use pure python:

from collections import Counter
from itertools import chain

out = Counter(chain.from_iterable(df['tags']))

Output:

Counter({'friends': 2, 'family': 2, 'holiday': 1})

And if you want to split the DataFrame like your grpby object:

tmp = df.assign(group=df['tags']).explode('group')
group = tmp.pop('group')

out = dict(list(tmp.groupby(group)))

Output:

{'family':       Id               tags
           0  item1  [friends, family]
           3  item4  [family, holiday],
 'friends':       Id               tags
           0  item1  [friends, family]
           1  item2          [friends],
 'holiday':       Id               tags
           3  item4  [family, holiday]}

Upvotes: 1

PaulS
PaulS

Reputation: 25298

A possible solution, whose steps are:

  • The explode method is used to transform each item of a list-like to a row, replicating index values.

  • The value_counts function is then applied to count the occurrences of each unique value in the tags column.

NB: I thank @user19077881 for suggesting, in a comment below, to use reset_index() here. I also thank @ouroboros1 for letting me know that to_frame() is not needed.

df.explode('tags').value_counts('tags').reset_index()

Output:

      tags  count
0   family      2
1  friends      2
2  holiday      1

Upvotes: 1

Related Questions