Reputation: 35
I have a dataframe
that I am trying to use pandas.groupby
on to get the sum. The values that I am grouping are as follows:
order_id otype score
0 id1 1 1.23
1 id1 2 1.56
2 id1 3 1.65
3 id1 4 1.73
4 id1 5 0.89
5 id2 1 1.07
6 id2 2 1.38
7 id2 3 0.94
8 id2 4 0.72
9 id2 5 1.37
10 id3 1 1.04
11 id3 2 0.56
12 id3 3 0.78
13 id3 4 1.12
14 id3 5 0.84
and I would like to perform a .groupby
operation on order_id
based on category values of otype
, I want to create groupings based on otype as follows:
otype in [1,2,3] --> class1
otype in [4,5] --> class2
otype in [1,2,3,4,5] --> class3
otype in [2,3,5] --> class4
I would like the (very) end result to be a table looking like this:
order_id otype score_sum
0 id1 class1 4.4
1 id1 class2 2.62
2 id1 class3 7.06
3 id1 class4 4.1
One way I can think of doing this is to calculate the result for each class separately and then use pd.concat
.But I don't think this is a computationally efficient way, nor is it a code-simple way.
Is there a better way?
Upvotes: 1
Views: 52
Reputation: 863176
I think the best is test both solutions in real data - but in my opinion pd.concat
solution is simplier and if few keys in dictionary should be faster:
d= {'class1': [1,2,3],
'class2': [4,5],
'class3': [1,2,3,4,5],
'class4': [2,3,5]}
df1 = pd.concat({k: df.loc[df['otype'].isin(v)]
.groupby('order_id').score.sum() for k, v in d.items()})
print (df1)
order_id
class1 id1 4.44
id2 3.39
id3 2.38
class2 id1 2.62
id2 2.09
id3 1.96
class3 id1 7.06
id2 5.48
id3 4.34
class4 id1 4.10
id2 3.69
id3 2.18
Name: score, dtype: float64
Alternative solution:
df1 = (pd.concat({k:df.loc[df['otype'].isin(v),['order_id','score']].assign(**{'class':k})
for k, v in d.items()})
.groupby(['class', 'order_id'])
.score.sum())
print (df1)
class order_id
class1 id1 4.44
id2 3.39
id3 2.38
class2 id1 2.62
id2 2.09
id3 1.96
class3 id1 7.06
id2 5.48
id3 4.34
class4 id1 4.10
id2 3.69
id3 2.18
Name: score, dtype: float64
for k, v in d.items():
df.loc[df['otype'].isin(v), k] = k
df = (df.drop('otype',1)
.melt(['order_id','score'])
.groupby(['value','order_id'])['score'].sum())
print (df)
value order_id
class1 id1 4.44
id2 3.39
id3 2.38
class2 id1 2.62
id2 2.09
id3 1.96
class3 id1 7.06
id2 5.48
id3 4.34
class4 id1 4.10
id2 3.69
id3 2.18
Name: score, dtype: float64
Upvotes: 1