rsalmei
rsalmei

Reputation: 3635

Pandas modify column with groupby results while ignoring some values

I'm trying to use Pandas to modify a huge dataset with the results of a groupby operation. What I need is to find the minimum value of groups of data, while ignoring zeros, but return this same zero in case it is the only value in the group.

Consider this dummy data:

>>> a=pd.DataFrame(dict(item_id=[1,1,1,2,2,2], pos_id=[3,7,7,7,7,7], target='T1 T2 T3 T1 T2 T3'.split(), val=[8,0,0,41,0,55]))
>>> a
   item_id  pos_id target  val
0        1       3     T1    8
1        1       7     T2    0
2        1       7     T3    0
3        2       7     T1   41
4        2       7     T2    0
5        2       7     T3   55

For each item_id in pos_id == 7 and target == 'T2', I'd like to get the smallest non zero val of the whole group (ignoring target), and substitute it in place with this!

So, I'd like to get this in the end:

   item_id  pos_id target  val
0        1       3     T1    8  <-- this row has the wrong pos_id and is ignored
1        1       7     T2    0  <-- this one maintains zero (all group has zeros)
2        1       7     T3    0
3        2       7     T1   41
4        2       7     T2   41  <-- this one gets the smallest of item_id group
5        2       7     T3   55

I have tried this:

>>> a.ix[(a.pos_id == 7) & (a.target == 'T2'), 'val'] = a.ix[a.pos_id == 7].groupby('item_id').val.min().values

But it doesn't work because the zeros aren't ignored. And I can't simply ignore them, because the Series in both sides of the assignment will have different sizes!

>>> a.ix[(a.pos_id == 7) & (a.target == 'T2'), 'val'].size, a.ix[(a.pos_id == 7) & (a.val != 0)].groupby('item_id').val.min().values.size
(2, 1)

I have also tried to put a high value in those zeros, so the first will almost succeed:

>>> a.ix[(a.pos_id == 7) & (a.val == 0), 'val'] = 9999
>>> a
   item_id  pos_id target   val
0        1       3     T1     8
1        1       7     T2  9999
2        1       7     T3  9999
3        2       7     T1    41
4        2       7     T2  9999
5        2       7     T3    55

But:

>>> a.ix[(a.pos_id == 7) & (a.target == 'T2'), 'val'] = a.ix[a.pos_id == 7].groupby('item_id').val.min().values
>>> a
   item_id  pos_id target   val
0        1       3     T1     8
1        1       7     T2  9999  <-- this one should maintain zero...
2        1       7     T3  9999
3        2       7     T1    41
4        2       7     T2    41  <-- this one works!
5        2       7     T3    55

But now I'd have to return to zero the values that shouldn't be modified... Ohhh, it has to be a better way!

Please, how do I do this set value with groupby ignoring some values, in one simpler step? And please also consider that performance is important, since the dataset has a few gigabytes...

THANK YOU!

Upvotes: 1

Views: 347

Answers (1)

jezrael
jezrael

Reputation: 863206

I think you can use condition with numpy.where for checking if all values are 0, then output is 0 else is min without 0 values with transform instead apply:

a=pd.DataFrame(dict(item_id=[1,1,1,2,2,2], 
                    pos_id=[3,7,7,7,7,7], 
                    target='T1 T2 T3 T1 T2 T3'.split(), 
                    val=[8,0,0,41,0,55]))
print a
   item_id  pos_id target  val
0        1       3     T1    8
1        1       7     T2    0
2        1       7     T3    0
3        2       7     T1   41
4        2       7     T2    0
5        2       7     T3   55

a.ix[(a.pos_id == 7) & (a.target == 'T2'), 'val'] = 
a.ix[a.pos_id == 7].groupby('item_id').val.transform(lambda x: np.where((x == 0).all(), 
                                                                        0, 
                                                                        x[x!=0].min()))
print a
   item_id  pos_id target  val
0        1       3     T1    8
1        1       7     T2    0
2        1       7     T3    0
3        2       7     T1   41
4        2       7     T2   41
5        2       7     T3   55

Upvotes: 1

Related Questions