Better way to do calculations for multiple dataframes and return statement?

Question

My function looks through 3 dataframes, filters between different dates, and creates a statement.
As you can see, the function is reusing the same steps over and over, and I would like to reduce them.
I believe using a for-loop would help, but I'm unsure of how the return statement will be made in one small paragraph like I have now.

def stat_generator(df,date1,date2,df2,date3,date4,df4,date5,date6): 
    ##First Date Filter for First Dataframe, and calculations for first dataframe
    
    df['Announcement Date'] = pd.to_datetime(df['Announcement Date'])
    mask = ((df['Announcement Date'] >= date1) & (df['Announcement Date'] <= date2))
    df_new = df.loc[mask]
    total = len(df_new)
    better = df_new[(df_new['performance'] == 'better')]
    better_perc = round(((len(better)/total)*100),2)
    worse = df_new[(df_new['performance'] == 'worse')]
    worse_perc = round(((len(worse)/total)*100),2)
    statement1 = "During the time period between {} and {}, {} % of the students performed better. {} % 
    of the students performed worse" .format(date1,date2,better_perc,worse_perc)
    
    ##Second Date Filter for Second Dataframe, and calculations for second dataframe
    
    df2['Announcement Date'] = pd.to_datetime(df2['Announcement Date'])
    mask2 = ((df2['Announcement Date'] >= date3) & (df2['Announcement Date'] <= date4))
    df_new2 = df2.loc[mask2]
    total2 = len(df_new2)
    better2 = df_new2[(df_new2['performance'] == 'better')]
    better_perc2 = round(((len(better2)/total2)*100),2)
    worse2 = df_new2[(df_new2['performance'] == 'worse')]
    worse_perc2 = round(((len(worse2)/total2)*100),2)
    statement2 = "During the time period between {} and {}, {} % of the students performed better. {} % 
    of the students performed worse" .format(date3,date4,better_perc2,worse_perc2)
    
    ##Third Date Filter for Third Dataframe, and calculations for third dataframe
    
    df3['Announcement Date'] = pd.to_datetime(df3['Announcement Date'])
    mask3 = ((df3['Announcement Date'] >= date5) & (df3['Announcement Date'] <= date6))
    df_new3 = df3.loc[mask3]
    total3 = len(df_new3)
    better3 = df_new3[(df_new3['performance'] == 'better')]
    better_perc3 = round(((len(better3)/total3)*100),2)
    worse3 = df_new3[(df_new3['performance'] == 'worse')]
    worse_perc3 = round(((len(worse3)/total3)*100),2)
    statement3 = "During the time period between {} and {}, {} % of the students performed better. {} % 
    of the students performed worse" .format(date5,date6,better_perc3,worse_perc3)

    statement = statement1 + statement2 + statement3 
    return statement

GhandiFloss · Accepted Answer

I would just pass 3 parameters to your function those being df, date1 and date2 and then call your function 3 times.

def stat_generator(df,date1,date2):
    "..."
    return statement

Then pass in your data as a list of lists or something similar. For example:

data = [[df,date1,date2],[df2,date3,date4],[df4,date5,date6]]

for lists in data:
    stat_generator(*lists)

Better way to do calculations for multiple dataframes and return statement?

Answers (2)

Maintaining Present Form

Complete Rewrite

Related Questions