Enes
Enes

Reputation: 189

Pandas apply and lambda function efficiency

I have the following data frame

data = 
     a     b     c    val
0   'a1'  'b1'  'c1'   5
1   'a1'  'b1'  'c2'   10
2   'a1'  'b2'  'c3'   1 
3   'a2'  'b1'  'c4'   30
4   'a2'  'b1'  'c5'   20
5   'a2'  'b1'  'c6'   10

I want to create a new data frame with same columns, where the value of a given (a,b,c) is equal to sum of the values of all (aa,bb, cc)'s if a==aa, b==bb and value(aa,bb,cc) >= value(a,b,c). Here aa is an alias for a, and so on.

The desired outcome is

data_new = 
     a     b     c    val
0   'a1'  'b1'  'c1'   15
1   'a1'  'b1'  'c2'   10
2   'a1'  'b2'  'c3'   1 
3   'a2'  'b1'  'c4'   30
4   'a2'  'b1'  'c5'   50
5   'a2'  'b1'  'c6'   60

And I do the following:

data['key'] = data['a']+data['b']
data_new = data
data_new = data.apply(lambda row: data[ (data.key==row.key) 
                                      & (data.val>=row.val) ].val.sum(), axis=1)
del data_new['key']

Actually, in my real data, I have more columns like a and b, that is why boolean comparison such as

(data.a==row.a) & (data.a==row.b) & ...

is slower. The thing is that even the current implementation is not as fast as I want. I have huge tables. Is there any way to do the same thing faster & more efficient?

Upvotes: 0

Views: 2117

Answers (1)

DSM
DSM

Reputation: 353159

If I understand you, then you might be able to use cumsum with a bit of sorting:

>>> grouped = df.sort("val", ascending=False).groupby(["a", "b"])
>>> df["new_val"] = grouped["val"].cumsum()
>>> df
    a   b   c  val  new_val
0  a1  b1  c1    5       15
1  a1  b1  c2   10       10
2  a1  b2  c3    1        1
3  a2  b1  c4   30       30
4  a2  b1  c5   20       50
5  a2  b1  c6   10       60

Since you want to sum those values in a group which are >= the one of interest, you can put the values in descending order within an A, B group and then take the cumulative sum of those.

Upvotes: 2

Related Questions