Reputation: 1816
Having some issues with data manipulation in pandas, seems like maybe a pandas bug? Would love some ideas.
I've got a (index-sorted) dataframe my_df that looks something like this:
value1 value2
col0 col1 col2 col3 col4
0035ca76-b209-4c4e-9bba-b18459c4dceb 203 positive 173 148 0.0 0.086892
negative 1156 148 0.0 0.090347
1157 148 0.0 0.090347
1158 148 0.0 0.090347
1159 148 0.0 0.084884
1160 148 0.0 0.079942
1161 148 0.0 0.079824
1162 148 0.0 0.071289
positive 173 66 0.0 0.079831
negative 1156 66 0.0 0.082660
1157 66 0.0 0.082660
1158 66 0.0 0.082660
1159 66 0.0 0.084353
1160 66 0.0 0.076934
1161 66 0.0 0.076494
1162 66 0.0 0.070424
00e35aaf-050a-4f09-bf94-df994e4bf681 24 positive 14 38 0.0 0.073936
negative 134 38 0.0 0.075913
135 38 0.0 0.075913
136 38 0.0 0.074403
137 38 0.0 0.081120
138 38 0.0 0.078560
139 38 0.0 0.080680
140 38 0.0 0.073892
positive 14 1 0.0 0.051979
negative 134 1 0.0 0.043818
135 1 0.0 0.043818
136 1 0.0 0.049795
137 1 0.0 0.052171
138 1 0.0 0.048573
139 1 0.0 0.045205
140 1 0.0 0.054696
... more rows for this and other col0 + col1 combos
I'm trying to just compute the sum of "value2" for each unique combination of [col0, col1, col2, col3]. As far as I can tell, the most logical way to do this would be
my_df.groupby(level=list(range(4))).sum()
However, I'm getting really odd results that seem like a pandas bug.
grouped = my_df.groupby(list(range(4)))
for name, group in grouped:
print(group)
break
sums = grouped.sum()
Indeed the first group is as I'd expect
value1 value2
col0 col1 col2 col3 col4
0035ca76-b209-4c4e-9bba-b18459c4d681 199 positive 174 151 0.0 0.089186
158 0.0 0.104250
and the number of groups in grouped
is correct (you'll have to take my word for that, I've verified other ways) but sums
is haywire and has a bajillion extraneous rows
(Pdb) len(grouped)
334
(Pdb) len(sums)
53760
(Pdb) sums[:30]
value1 valu2
col0 col1 col2 col3 col4
1f11aede-6aed-44ef-9296-004b6269662c 17 positive 7 1 0.0 0.0
4 0.0 0.0
5 0.0 0.0
6 0.0 0.0
7 0.0 0.0
8 0.0 0.0
11 0.0 0.0
12 0.0 0.0
24 0.0 0.0
32 0.0 0.0
33 0.0 0.0
38 0.0 0.0
39 0.0 0.0
53 0.0 0.0
56 0.0 0.0
66 0.0 0.0
69 0.0 0.0
70 0.0 0.0
72 0.0 0.0
73 0.0 0.0
75 0.0 0.0
85 0.0 0.0
91 0.0 0.0
94 0.0 0.0
116 0.0 0.0
119 0.0 0.0
The values given in col4
are widely varied throughout the dataframe. It looks like the groupby + aggregation op creates a sum row for every value of col4
in the entire dataframe, as opposed to just values of col4
that actually pertain to each group. In other words, most of these rows don't even have entries in the original dataframe:
(Pdb) my_df.loc[("1f11aede-6aed-44ef-9296-004b6269662c", 17, "positive", 7, 1)]
*** KeyError: ('1f11aede-6aed-44ef-9296-004b6269662c', 17, 'positive', 7, 1)
Any idea what's going on here? These seems totally off-script from what the groupby API and tutorial describe. For example, as far as I know groupby => agg should create one row per group here.
Upvotes: 0
Views: 341
Reputation: 1189
Try following sentences to get sum of value2
column by grouping slice_id
, col1
, col2
, col3
columns data
my_df.groupby(['slice_id', 'col1', 'col2', 'col3']).agg('sum')
or
my_df.groupby(['slice_id', 'col1', 'col2', 'col3'])[['value2']].agg('sum')
Upvotes: 0
Reputation: 1816
TLDR: As of pandas 1.2.4, groupby
has nonintuitive behavior when one of the indexes is Categorical. Fixed with groupby(..., observed=True)
.
Ok, this one took me a while. Turns out if one of your indexes is Categorical
then groupby
has, in my opinion, totally nonintuitive behavior.
# Does some sort of cartesian products of index values if any of the indexes are Categorical.
my_df.groupby(level=list(range(4)))
# Doesn't do the cartesian product, in line with behavior for every other index type.
my_df.groupby(level=list(range(4)), observed=True)
In my case, col2
was Categorical. That is, at some point in my code, I had:
col2_type = pd.api.types.CategoricalDtype(categories=["positive", "negative"], ordered=True)
col2 = ["positive", "negative", "negative", "negative", #...]
# Make col2 categorical
my_df['col2'] = my_df.assign(col2=col2)['col2'].astype(col2_type)
A bit of mea culpa on this one, this behavior is documented (see observed
: ). That said, I'm not the only one who has been thoroughly confused by this.
observed
in some future release of pandas
(as of this writing, PR is unmerged)With any luck, this "issue" will be fixed in an upcoming release.
Upvotes: 1