Clem Wang
Clem Wang

Reputation: 739

How do I use the Counter class with pandas groupby and apply

Given this DataFrame:

df = pd.DataFrame([[1,1],[2,2],[2,3],[2,3],[2,4]], columns = ['A','B'])
df
    A   B
0   1   1
1   2   2
2   2   3
3   2   3
4   2   4

I want to try different collections of aggregating values in B using groupby on column A and apply on column B

This works as expected if I collect B's as a list:

df.groupby('A')['B'].apply(list).reset_index(name='list')
    A   list
0   1   [1]
1   2   [2, 3, 3, 4]

This works as expected if I collect B's as a set:

df.groupby('A')['B'].apply(set).reset_index(name='set')
    A   set
0   1   {1}
1   2   {2, 3, 4}

I (naively) would have expected the Counter class to just work the same way:

from collections import Counter
Counter([2, 3, 3, 4])
Counter({2: 1, 3: 2, 4: 1})

But it behaves rather unexpectedly when trying to use Counter the same way I used set or list:

df.groupby('A')['B'].apply(Counter).reset_index(name='counter')
A   level_1 counter
0   1   1   1.0
1   1   2   NaN
2   1   3   NaN
3   1   4   NaN
4   2   1   NaN
5   2   2   1.0
6   2   3   2.0
7   2   4   1.0

I was hoping for:

    A   counter
0   1   Counter({1: 1})
1   2   Counter({2: 1, 3: 2, 4: 1})

One interesting clue is this:

df.groupby('A')['B'].apply(type).reset_index(name='type')
A   type
0   1   <class 'pandas.core.series.Series'>
1   2   <class 'pandas.core.series.Series'>

But this works as I would expect:

Counter(pd.core.series.Series([2, 3, 3, 4]))
Counter({2: 1, 3: 2, 4: 1})

And this doesn't work:

def mycounter(series):
    return Counter(list(series))
mycounter
df.groupby('A')['B'].apply(mycounter).reset_index(name='type')
A   level_1 type
0   1   1   1.0
1   1   2   NaN
2   1   3   NaN
3   1   4   NaN
4   2   1   NaN
5   2   2   1.0
6   2   3   2.0
7   2   4   1.0

I kind of suspect a bug with Pandas?

(ADDED): I just tried this and it works. So, I'm not sure why apply doesn't but agg does:

df.groupby('A')['B'].agg([Counter]).reset_index()
A   Counter
0   1   {1: 1}
1   2   {2: 1, 3: 2, 4: 1}

Upvotes: 1

Views: 735

Answers (1)

Henry Ecker
Henry Ecker

Reputation: 35636

See groupby agg

df.groupby('A')['B'].agg(Counter).reset_index(name='counter')
   A             counter
0  1              {1: 1}
1  2  {2: 1, 3: 2, 4: 1}

apply is an interesting function, as it can both produce aggregated and non-aggregated results.

Run:

df.groupby('A')['B'].apply(lambda x: {0: 1, 1: 2, 2: 3})
A   
1  0    1
   1    2
   2    3
2  0    1
   1    2
   2    3
Name: B, dtype: int64

When a dict is returned from apply it interprets the keys as indexes to the DataFrame. As opposed to interpreting it as an aggregated value (like with agg).

Hence why counters are interpreted like:

A   
1  1    1.0  # {1: 1} (index 1 value 1)
   2    NaN
   3    NaN
   4    NaN
2  1    NaN
   2    1.0  # {2: 1, 3: 2, 4: 1} (index 2 value 1)
   3    2.0  # (index 3 value 2)
   4    1.0  # (index 4 value 1)
Name: B, dtype: float64

However, with agg there is a single value expected to be returned so dict is interpreted as a single unit:

Run:

df.groupby('A')['B'].agg(lambda x: {0: 1, 1: 2, 2: 3})
A
1    {0: 1, 1: 2, 2: 3}
2    {0: 1, 1: 2, 2: 3}
Name: B, dtype: object

Upvotes: 4

Related Questions