Thomas
Thomas

Reputation: 21

How to apply multiple functions to a pandas dataframe without multiple loops?

I use pandas for an ETL process. I query a database I put the result in a dataframe; the dataframe is pretty big (1M rows * 50 columns). The dataframe is mainly composed of string and date.

I use the apply() function of pandas to make all my transformations. The thing is my transformation contains multiple branching on string.

df['merged_contract_status'] = df.apply(lib.merged_contract_status, axis=1) 
df['renewed'] = df.apply(lib.contract_renewed, axis=1)
df['renewal_dt'] = df.apply(lib.contract_renewed_date, axis=1)
....

I have a bunch of transformation like that. The functions I call:

def merged_contract_status(row):
    if row['tbm_contract_status'] is not np.nan:
        mergedContractStatus = row['tbm_contract_status']
    else:
        mergedContractStatus = row['ccn_status']

    return mergedContractStatus

def contract_renewed(row):
    ccn_activation_dt = row['ccn_activation_dt']
    ccn_status = row['ccn_status']
    f_init_val_dt = row['f_init_val_dt']
    f_nat_exp_dt = row['f_nat_exp_dt']
    spr_spo_code = row['spr_spo_code']  
    csp_status_sep_1 = row['csp_status_sep_1']
    csp_begin_dt_sep_1 = row['csp_begin_dt_sep_1']
    ctt_type_1 = row['ctt_type_1']  
    csp_status_sep_2 = row['csp_status_sep_2']
    csp_begin_dt_sep_2 = row['csp_begin_dt_sep_2']
    ctt_type_2 = row['ctt_type_2']  
    csp_status_sep_3 = row['csp_status_sep_3']
    csp_begin_dt_sep_3 = row['csp_begin_dt_sep_3']
    ctt_type_3 = row['ctt_type_3']
    csp_begin_dt_sep_father = row['csp_begin_dt_sep_father']
    csp_end_dt_sep_father = row['csp_end_dt_sep_father']
    todayDate = datetime.datetime.today()       

    if spr_spo_code == 'PCC':       
        if ctt_type_1 == 'NORMAL' and ccn_activation_dt is not None and csp_begin_dt_sep_1 is not None and csp_begin_dt_sep_1>(ccn_activation_dt+timedelta(365)):
            return 'Y'
        elif ctt_type_2 == 'NORMAL' and ccn_activation_dt is not None and csp_begin_dt_sep_2 is not None and csp_begin_dt_sep_2>(ccn_activation_dt+timedelta(365)):
            return 'Y'
        elif ctt_type_3== 'NORMAL' and ccn_activation_dt is not None and csp_begin_dt_sep_3 is not None and csp_begin_dt_sep_3>(ccn_activation_dt+timedelta(365)):
            return 'Y'
        else:
            return 'N'
    else:
        if (f_init_val_dt is None and f_nat_exp_dt is None and 
            ccn_activation_dt is not None and 
            ccn_activation_dt < (todayDate- timedelta(365)) and 
            (csp_begin_dt_sep_father <= todayDate and csp_begin_dt_sep_father >= todayDate and ccn_status=='ACTIVATED')):
            return 'Y'
        elif f_init_val_dt is not None and f_nat_exp_dt is not None and f_init_val_dt <= todayDate and f_nat_exp_dt >= todayDate and ccn_status=='ACTIVATED' and ccn_activation_dt is not None and ccn_activation_dt < (todayDate- timedelta(365)):
            return 'Y'
        else:
            return 'N'

Each time I call apply on my df, pandas loops through the entire df and it takes around 10 min. I'm OK with that 10 min; I know I cannot improve performance. But is there a way to avoid multiple loop? Can pandas loop only once and do all the transformation I want?

EDIT: It's hard to give you data since the dataframe is big and it is built with sql query. The help I want is a way to loop only once through the dataframe, I do not want to improve each function (taht's kind of impossible with those branching on string)

Thanks

Upvotes: 2

Views: 2978

Answers (1)

firelynx
firelynx

Reputation: 32214

Python and Pandas can only do one thing at a time. Your functions does a lot of things though and you could build them together to one function. But that is not actually your problem here.

Applying functions with axis=1 is super heavy, because you are iterating over the whole dataframe.

You are using pandas, but you are not using pandas.

You should rewrite the functions you apply into vectorized operations.

It looks like the .loc method could help you out a lot

Index your dataframe on anything you are using in the if statements

df.set_index(['spr_spo_code', 'ctt_type_1', 'ccn_activation_dt'], inplace=True)

and so on.

Then you can use the .loc, but first you should create a column where you want the result.

df['Contract_renewed'] = 'N'
df.loc['PCC', 'NORMAL', None ..., 'Contract_renewed'] = 'Y'

And this will be crazy much faster than your current way of operation.

Tiny example

>>> df = pd.DataFrame({'foo':[1,2,3], 'bar':['baz','boo','bee'], 'baz':['N']*3})
>>> df.set_index(['bar', 'foo'], inplace=True)
>>> df.loc[('baz', 1), 'baz'] = 'Y'
>>> df
        baz
bar foo    
baz 1     Y
boo 2     N
bee 3     N

Upvotes: 1

Related Questions