Reputation: 3635
I'm trying to use Pandas to modify a huge dataset with the results of a groupby operation. What I need is to find the minimum value of groups of data, while ignoring zeros, but return this same zero in case it is the only value in the group.
Consider this dummy data:
>>> a=pd.DataFrame(dict(item_id=[1,1,1,2,2,2], pos_id=[3,7,7,7,7,7], target='T1 T2 T3 T1 T2 T3'.split(), val=[8,0,0,41,0,55]))
>>> a
item_id pos_id target val
0 1 3 T1 8
1 1 7 T2 0
2 1 7 T3 0
3 2 7 T1 41
4 2 7 T2 0
5 2 7 T3 55
For each item_id in pos_id == 7 and target == 'T2', I'd like to get the smallest non zero val of the whole group (ignoring target), and substitute it in place with this!
So, I'd like to get this in the end:
item_id pos_id target val
0 1 3 T1 8 <-- this row has the wrong pos_id and is ignored
1 1 7 T2 0 <-- this one maintains zero (all group has zeros)
2 1 7 T3 0
3 2 7 T1 41
4 2 7 T2 41 <-- this one gets the smallest of item_id group
5 2 7 T3 55
I have tried this:
>>> a.ix[(a.pos_id == 7) & (a.target == 'T2'), 'val'] = a.ix[a.pos_id == 7].groupby('item_id').val.min().values
But it doesn't work because the zeros aren't ignored. And I can't simply ignore them, because the Series in both sides of the assignment will have different sizes!
>>> a.ix[(a.pos_id == 7) & (a.target == 'T2'), 'val'].size, a.ix[(a.pos_id == 7) & (a.val != 0)].groupby('item_id').val.min().values.size
(2, 1)
I have also tried to put a high value in those zeros, so the first will almost succeed:
>>> a.ix[(a.pos_id == 7) & (a.val == 0), 'val'] = 9999
>>> a
item_id pos_id target val
0 1 3 T1 8
1 1 7 T2 9999
2 1 7 T3 9999
3 2 7 T1 41
4 2 7 T2 9999
5 2 7 T3 55
But:
>>> a.ix[(a.pos_id == 7) & (a.target == 'T2'), 'val'] = a.ix[a.pos_id == 7].groupby('item_id').val.min().values
>>> a
item_id pos_id target val
0 1 3 T1 8
1 1 7 T2 9999 <-- this one should maintain zero...
2 1 7 T3 9999
3 2 7 T1 41
4 2 7 T2 41 <-- this one works!
5 2 7 T3 55
But now I'd have to return to zero the values that shouldn't be modified... Ohhh, it has to be a better way!
Please, how do I do this set value with groupby ignoring some values, in one simpler step? And please also consider that performance is important, since the dataset has a few gigabytes...
THANK YOU!
Upvotes: 1
Views: 347
Reputation: 863206
I think you can use condition with numpy.where
for checking if all
values are 0
, then output is 0
else is min
without 0
values with transform
instead apply
:
a=pd.DataFrame(dict(item_id=[1,1,1,2,2,2],
pos_id=[3,7,7,7,7,7],
target='T1 T2 T3 T1 T2 T3'.split(),
val=[8,0,0,41,0,55]))
print a
item_id pos_id target val
0 1 3 T1 8
1 1 7 T2 0
2 1 7 T3 0
3 2 7 T1 41
4 2 7 T2 0
5 2 7 T3 55
a.ix[(a.pos_id == 7) & (a.target == 'T2'), 'val'] =
a.ix[a.pos_id == 7].groupby('item_id').val.transform(lambda x: np.where((x == 0).all(),
0,
x[x!=0].min()))
print a
item_id pos_id target val
0 1 3 T1 8
1 1 7 T2 0
2 1 7 T3 0
3 2 7 T1 41
4 2 7 T2 41
5 2 7 T3 55
Upvotes: 1