Reputation: 13858
If this is a dupe please guide the way. I checked a few questions that came close but doesn't solve my issue.
I have a dummy DataFrame
as follows:
grp Ax Bx Ay By A_match B_match
0 foo 3 2 2 2 False True
1 foo 2 1 1 0 False False
2 foo 4 3 0 3 False True
3 foo 4 3 1 4 False False
4 foo 4 4 3 0 False False
5 bar 3 0 3 0 True True
6 bar 3 4 0 3 False False
7 bar 1 2 1 2 True True
8 bar 1 3 4 1 False False
9 bar 1 1 0 3 False False
My goal is to compare the A
s and B
s columns and summarize the result by grp
thus:
A_match B_match
False True False True
grp
bar 3 2 3 2
foo 5 0 3 2
So I added the two _match
columns as follows, to get the above df
:
df['A_match'] = df['Ax'].eq(df['Ay'])
df['B_match'] = df['Bx'].eq(df['By'])
Based on my understanding, I was hoping I could do something like this, but it doesn't work:
df.groupby('grp')[['A_match', 'B_match']].agg(pd.Series.value_counts)
# trunc'd Traceback:
# ... ValueError: no results ...
# ... During handling of the above exception, another exception occurred: ...
# ... ValueError: could not broadcast input array from shape (5,7) into shape (5)
In my actual data, I was able to sidestep this by forcing _match
es to be pd.Categorical
in a rather unsatisfactory manner. However, I've noted on and off success, and even with this dummy data I'm getting the exact error as above even using pd.Categorial
:
df['A_match'] = pd.Categorical(df['Ax'].eq(df['Ay']).values, categories=[True, False])
df['B_match'] = pd.Categorical(df['Bx'].eq(df['By']).values, categories=[True, False])
df.groupby('grp')[['A_match', 'B_match']].agg(pd.Series.value_counts)
# ... ValueError: could not broadcast input array from shape (5,7) into shape (5)
It makes no sense to me - where is shape (5, 7) even coming from? Each agg
would have passed a shape (5,)
last I checked. And even the agg
seems to be running differently than I imagined, it should be ran against the Series
:
>>> df.groupby('grp')[['A_match', 'B_match']].agg(lambda x: type(x))
A_match B_match
grp
bar <class 'pandas.core.series.Series'> <class 'pandas.core.series.Series'>
foo <class 'pandas.core.series.Series'> <class 'pandas.core.series.Series'>
# Good - it's Series, I should be able to call value_counts directly?
>>> df.groupby('grp')[['A_match', 'B_match']].agg(lambda x: x.value_counts())
# AttributeError: 'DataFrame' object has no attribute 'value_counts' <-- ?!?!? Where did 'DataFrame' come from?
I was eventually able to use the following combination, but still rather unsatisfactory as it introduces a lot of unnecessary axis
names.
>>> df.melt(id_vars='grp', value_vars=['A_match', 'B_match']).reset_index().pivot_table(index='grp', columns=['variable', 'value'], aggfunc=pd.Series.count)
index
variable A_match B_match
value False True False True
grp
bar 3 2 3 2
foo 5 0 3 2
Either method just seem rather contrived to achieve something that should be relatively common usage. I guess my question is, am I overlooking something obvious here?
Upvotes: 3
Views: 702
Reputation: 150735
You can agg
on dictionary:
(df.groupby('grp').agg({'A_match':'value_counts',
'B_match':'value_counts'})
.unstack(-1, fill_value=0)
)
Output:
A_match B_match
False True False True
bar 3.0 2.0 3 2
foo 5.0 NaN 3 2
Upvotes: 4