Reputation: 739
Given this DataFrame:
df = pd.DataFrame([[1,1],[2,2],[2,3],[2,3],[2,4]], columns = ['A','B'])
df
A B
0 1 1
1 2 2
2 2 3
3 2 3
4 2 4
I want to try different collections of aggregating values in B using groupby
on column A and apply
on column B
This works as expected if I collect B's as a list:
df.groupby('A')['B'].apply(list).reset_index(name='list')
A list
0 1 [1]
1 2 [2, 3, 3, 4]
This works as expected if I collect B's as a set:
df.groupby('A')['B'].apply(set).reset_index(name='set')
A set
0 1 {1}
1 2 {2, 3, 4}
I (naively) would have expected the Counter class to just work the same way:
from collections import Counter
Counter([2, 3, 3, 4])
Counter({2: 1, 3: 2, 4: 1})
But it behaves rather unexpectedly when trying to use Counter the same way I used set or list:
df.groupby('A')['B'].apply(Counter).reset_index(name='counter')
A level_1 counter
0 1 1 1.0
1 1 2 NaN
2 1 3 NaN
3 1 4 NaN
4 2 1 NaN
5 2 2 1.0
6 2 3 2.0
7 2 4 1.0
I was hoping for:
A counter
0 1 Counter({1: 1})
1 2 Counter({2: 1, 3: 2, 4: 1})
One interesting clue is this:
df.groupby('A')['B'].apply(type).reset_index(name='type')
A type
0 1 <class 'pandas.core.series.Series'>
1 2 <class 'pandas.core.series.Series'>
But this works as I would expect:
Counter(pd.core.series.Series([2, 3, 3, 4]))
Counter({2: 1, 3: 2, 4: 1})
And this doesn't work:
def mycounter(series):
return Counter(list(series))
mycounter
df.groupby('A')['B'].apply(mycounter).reset_index(name='type')
A level_1 type
0 1 1 1.0
1 1 2 NaN
2 1 3 NaN
3 1 4 NaN
4 2 1 NaN
5 2 2 1.0
6 2 3 2.0
7 2 4 1.0
I kind of suspect a bug with Pandas?
(ADDED): I just tried this and it works. So, I'm not sure why apply
doesn't but agg
does:
df.groupby('A')['B'].agg([Counter]).reset_index()
A Counter
0 1 {1: 1}
1 2 {2: 1, 3: 2, 4: 1}
Upvotes: 1
Views: 735
Reputation: 35636
See groupby agg
df.groupby('A')['B'].agg(Counter).reset_index(name='counter')
A counter
0 1 {1: 1}
1 2 {2: 1, 3: 2, 4: 1}
apply
is an interesting function, as it can both produce aggregated and non-aggregated results.
Run:
df.groupby('A')['B'].apply(lambda x: {0: 1, 1: 2, 2: 3})
A
1 0 1
1 2
2 3
2 0 1
1 2
2 3
Name: B, dtype: int64
When a dict
is returned from apply
it interprets the keys as indexes to the DataFrame. As opposed to interpreting it as an aggregated value (like with agg
).
Hence why counters are interpreted like:
A
1 1 1.0 # {1: 1} (index 1 value 1)
2 NaN
3 NaN
4 NaN
2 1 NaN
2 1.0 # {2: 1, 3: 2, 4: 1} (index 2 value 1)
3 2.0 # (index 3 value 2)
4 1.0 # (index 4 value 1)
Name: B, dtype: float64
However, with agg
there is a single value expected to be returned so dict
is interpreted as a single unit:
Run:
df.groupby('A')['B'].agg(lambda x: {0: 1, 1: 2, 2: 3})
A
1 {0: 1, 1: 2, 2: 3}
2 {0: 1, 1: 2, 2: 3}
Name: B, dtype: object
Upvotes: 4