optimized way of iterating through dataframe

Question

I have a pandas dataframe, called Visits2 contains 20M records. Here are sample of records from Visits2.

num         srv_edt     inpt_flag
000423733A  8/15/2016   N
001013135D  7/11/2016   N
001013135D  7/11/2016   N
001047851M  4/29/2016   N
001067291M  2/29/2016   Y
001067291M  8/3/2016    N
001067291M  8/3/2016    N
001067291M  9/4/2016    N
001070817A  5/25/2016   N
001070817A  5/25/2016   Y
001072424A  1/13/2016   N
001072424A  2/17/2016   Y
001072424A  3/21/2016   N
001072424A  3/21/2016   N
001072424A  5/10/2016   N
001072424A  6/6/2016    N

I'm executing below code, Assign inpt_any with N, when srv_edt is first occurrence in the group of num. if the inpt_flag already has the value as Y then assign inpt_flag with Y.

This is running fine, But consider at 20M volume, it is taking hours to run. Somebody, please suggest me optimize way of looping through the dataframe.

prev_srv_edt = " "
for vv in Visits2.itertuples():
    inpt_any = 'N'
    if (prev_srv_edt != vv[1]):
        prev_srv_edt = vv[1]
        Visits2.loc[vv[0],'inpt_any'] = 'N'
    if (vv[2] == 'Y'):
        Visits2.loc[vv[0],'inpt_any'] = 'Y'

I did try with list(zip(visit['srv_edt'],visit['inpt_flag'])), but I see zip also taking lot of time to run.

optimized way of iterating through dataframe

Answers (1)

Related Questions