Reputation: 189
I have the following data frame
data =
a b c val
0 'a1' 'b1' 'c1' 5
1 'a1' 'b1' 'c2' 10
2 'a1' 'b2' 'c3' 1
3 'a2' 'b1' 'c4' 30
4 'a2' 'b1' 'c5' 20
5 'a2' 'b1' 'c6' 10
I want to create a new data frame with same columns, where the value of a given (a,b,c) is equal to sum of the values of all (aa,bb, cc)'s if a==aa, b==bb and value(aa,bb,cc) >= value(a,b,c). Here aa is an alias for a, and so on.
The desired outcome is
data_new =
a b c val
0 'a1' 'b1' 'c1' 15
1 'a1' 'b1' 'c2' 10
2 'a1' 'b2' 'c3' 1
3 'a2' 'b1' 'c4' 30
4 'a2' 'b1' 'c5' 50
5 'a2' 'b1' 'c6' 60
And I do the following:
data['key'] = data['a']+data['b']
data_new = data
data_new = data.apply(lambda row: data[ (data.key==row.key)
& (data.val>=row.val) ].val.sum(), axis=1)
del data_new['key']
Actually, in my real data, I have more columns like a and b, that is why boolean comparison such as
(data.a==row.a) & (data.a==row.b) & ...
is slower. The thing is that even the current implementation is not as fast as I want. I have huge tables. Is there any way to do the same thing faster & more efficient?
Upvotes: 0
Views: 2117
Reputation: 353159
If I understand you, then you might be able to use cumsum
with a bit of sorting:
>>> grouped = df.sort("val", ascending=False).groupby(["a", "b"])
>>> df["new_val"] = grouped["val"].cumsum()
>>> df
a b c val new_val
0 a1 b1 c1 5 15
1 a1 b1 c2 10 10
2 a1 b2 c3 1 1
3 a2 b1 c4 30 30
4 a2 b1 c5 20 50
5 a2 b1 c6 10 60
Since you want to sum those values in a group which are >= the one of interest, you can put the values in descending order within an A, B group and then take the cumulative sum of those.
Upvotes: 2