MetaStack
MetaStack

Reputation: 3696

pandas speed up multi-row calculation by using groupby

I'm trying to make a calculation on multiple rows for every row in a dataframe.

My current solution takes almost 2 hours for 200k rows. so it's pretty inefficient, I'm hoping groupby or some other pandas methods can help me here.

my data looks like this for example (you can ignore the dates for now):

id group start_date end_date   three_yrs_ago_date days_missing
01 23    2005-01-01 2006-01-01 2002-01-01           1
02 23    2006-01-06 2007-01-06 2003-01-06           6
03 23    2007-01-15 2008-01-15 2004-01-15           9
07 17    2014-01-01 2015-02-01 2011-01-01           2
07 23    2015-01-01 2016-02-01 2012-01-01           4

so the goal here is to group everything by its group number and then add up all the days_missing of all the other rows in that group that occurred within the last 3 years. That is to say the other rows start_date is at or after the current row's three_yrs_ago_date, and at or before the current row's end_date.

That's a mouthful, but its basically three criteria. so that if this were the whole dataset we'd get this result (dropping the date columns):

id group days_missing days_missing_in_last_three_years            
01 23    1            1    # no change: no prior years
02 23    6            7 
03 23    9            16  
07 17    2            2    # no change: only member of it's group
07 23    4            4    # no change: other group members more than 3 years ago

I'll show you the code I currently have, but it's slow.

I go through the dataframe row by row, creating a temporary dataframe containing all it's group members, then I whittle those group members down to only the ones within the date criteria. It's not pretty:

days=[]
for index, row in tqdm(df.iterrows()):
    # moderately slow (~2 hour):
    temp = df[df['group'] == row['group']]
    temp = temp[temp['start_date'] >= row['three_yrs_ago_date']]
    temp = temp[temp['end_date'] <= row['start_date']]
    add = temp['days_missing'].sum() + row['days_missing']
    days.append(add)
df['days_missing_in_last_three_years'] = days

I tried 2 other approaches but neither was successful:

# very slow (~3 hours):
cov.append(df[(df['group'] == row['group']) & (df['start_date'] >= row['three_yrs_ago_date']) & (df['end_date'] <= row['start_date'])]['days_missing'].sum()+row['days_missing'])

# doesn't work - incorrect use of groupby
df['test'] = df[(df.groupby(['group'])['start_date'] >= df.groupby(['group'])['three_yrs_ago_date']) & (df.groupby(['group'])['end_date'] <= df.groupby(['group'])['start_date'])]['days_missing'].sum()

Is there a better way to do this more efficiently than breaking it up into smaller temporary dataframes and doing calculations on them?

Upvotes: 2

Views: 624

Answers (2)

Ben.T
Ben.T

Reputation: 29635

here is one solution, probably faster than your way. use a loop for on df.groupby('group'), then apply on each grouped datagrame df_g. You can use the method between to select the part of df_g betweem both dates for each row

for name, df_g in df.groupby('group'):
    df.loc[df_g.index,'test'] = df_g.apply(lambda row: (df_g['days_missing'][df_g['start_date']
                                                           .between(row['three_yrs_ago_date'], row['end_date'])].sum()),1)
df['test'] = df['test'].astype(int) #to get integer

the result is like expected:

   id  group start_date   end_date three_yrs_ago_date  days_missing  test
0   1     23 2005-01-01 2006-01-01         2002-01-01             1     1
1   2     23 2006-01-06 2007-01-06         2003-01-06             6     7
2   3     23 2007-01-15 2008-01-15         2004-01-15             9    16
3   7     17 2014-01-01 2015-02-01         2011-01-01             2     2
4   7     23 2015-01-01 2016-02-01         2012-01-01             4     4

EDIT: faster way by using numpy function:

import numpy as np
for name, df_g in df.groupby('group'):
    m_g = ( np.less_equal.outer(df_g['three_yrs_ago_date'], df_g['start_date']) 
            & np.greater_equal.outer(df_g['end_date'], df_g['start_date']) )
    df.loc[df_g.index,'test'] =np.dot(m_g, df_g['days_missing'])
df['test'] = df['test'].astype(int) #to get integer

Upvotes: 1

Tomas Farias
Tomas Farias

Reputation: 1343

Here's an attempt using .groupby, .loc and .transform:

import numpy as np

conditions = (
    (df['start_date'] >= df['three_yrs_ago_date'])
    & (df['end_date'] <= df['start_date'])
)
df['test'] = np.nan # initiliaze column, otherwise next line raises KeyError
df.loc[conditions, 'test'] = df.loc[conditions, ].groupby('group')['days_missing'].transform('sum')

Upvotes: 1

Related Questions