Reputation: 1328
I have a pandas dataframe, called Visits2 contains 20M records. Here are sample of records from Visits2.
num srv_edt inpt_flag
000423733A 8/15/2016 N
001013135D 7/11/2016 N
001013135D 7/11/2016 N
001047851M 4/29/2016 N
001067291M 2/29/2016 Y
001067291M 8/3/2016 N
001067291M 8/3/2016 N
001067291M 9/4/2016 N
001070817A 5/25/2016 N
001070817A 5/25/2016 Y
001072424A 1/13/2016 N
001072424A 2/17/2016 Y
001072424A 3/21/2016 N
001072424A 3/21/2016 N
001072424A 5/10/2016 N
001072424A 6/6/2016 N
I'm executing below code, Assign inpt_any with N
, when srv_edt is first occurrence in the group of num. if the inpt_flag already has the value as Y
then assign inpt_flag with Y
.
This is running fine, But consider at 20M volume, it is taking hours to run. Somebody, please suggest me optimize way of looping through the dataframe.
prev_srv_edt = " "
for vv in Visits2.itertuples():
inpt_any = 'N'
if (prev_srv_edt != vv[1]):
prev_srv_edt = vv[1]
Visits2.loc[vv[0],'inpt_any'] = 'N'
if (vv[2] == 'Y'):
Visits2.loc[vv[0],'inpt_any'] = 'Y'
I did try with list(zip(visit['srv_edt'],visit['inpt_flag']))
, but I see zip
also taking lot of time to run.
Upvotes: 0
Views: 48
Reputation: 210842
IIUC you can do it this way:
In [37]: df.loc[df.groupby('num')['srv_edt'].idxmin(), 'inpt_any'] = 'N'
In [38]: df
Out[38]:
num srv_edt inpt_flag inpt_any
0 000423733A 2016-08-15 N N
1 001013135D 2016-07-11 N N
2 001013135D 2016-07-11 N NaN
3 001047851M 2016-04-29 N N
4 001067291M 2016-02-29 Y N
5 001067291M 2016-08-03 N NaN
6 001067291M 2016-08-03 N NaN
7 001067291M 2016-09-04 N NaN
8 001070817A 2016-05-25 N N
9 001070817A 2016-05-25 Y NaN
10 001072424A 2016-01-13 N N
11 001072424A 2016-02-17 Y NaN
12 001072424A 2016-03-21 N NaN
13 001072424A 2016-03-21 N NaN
14 001072424A 2016-05-10 N NaN
15 001072424A 2016-06-06 N NaN
Upvotes: 1