Reputation: 3696
I'm trying to make a calculation on multiple rows for every row in a dataframe.
My current solution takes almost 2 hours for 200k rows. so it's pretty inefficient, I'm hoping groupby or some other pandas methods can help me here.
my data looks like this for example (you can ignore the dates for now):
id group start_date end_date three_yrs_ago_date days_missing
01 23 2005-01-01 2006-01-01 2002-01-01 1
02 23 2006-01-06 2007-01-06 2003-01-06 6
03 23 2007-01-15 2008-01-15 2004-01-15 9
07 17 2014-01-01 2015-02-01 2011-01-01 2
07 23 2015-01-01 2016-02-01 2012-01-01 4
so the goal here is to group everything by its group
number and then add up all the days_missing
of all the other rows in that group that occurred within the last 3 years. That is to say the other rows start_date
is at or after the current row's three_yrs_ago_date
, and at or before the current row's end_date
.
That's a mouthful, but its basically three criteria. so that if this were the whole dataset we'd get this result (dropping the date columns):
id group days_missing days_missing_in_last_three_years
01 23 1 1 # no change: no prior years
02 23 6 7
03 23 9 16
07 17 2 2 # no change: only member of it's group
07 23 4 4 # no change: other group members more than 3 years ago
I'll show you the code I currently have, but it's slow.
I go through the dataframe row by row, creating a temporary dataframe containing all it's group members, then I whittle those group members down to only the ones within the date criteria. It's not pretty:
days=[]
for index, row in tqdm(df.iterrows()):
# moderately slow (~2 hour):
temp = df[df['group'] == row['group']]
temp = temp[temp['start_date'] >= row['three_yrs_ago_date']]
temp = temp[temp['end_date'] <= row['start_date']]
add = temp['days_missing'].sum() + row['days_missing']
days.append(add)
df['days_missing_in_last_three_years'] = days
I tried 2 other approaches but neither was successful:
# very slow (~3 hours):
cov.append(df[(df['group'] == row['group']) & (df['start_date'] >= row['three_yrs_ago_date']) & (df['end_date'] <= row['start_date'])]['days_missing'].sum()+row['days_missing'])
# doesn't work - incorrect use of groupby
df['test'] = df[(df.groupby(['group'])['start_date'] >= df.groupby(['group'])['three_yrs_ago_date']) & (df.groupby(['group'])['end_date'] <= df.groupby(['group'])['start_date'])]['days_missing'].sum()
Is there a better way to do this more efficiently than breaking it up into smaller temporary dataframes and doing calculations on them?
Upvotes: 2
Views: 624
Reputation: 29635
here is one solution, probably faster than your way. use a loop for
on df.groupby('group'), then apply
on each grouped datagrame df_g
. You can use the method between to select the part of df_g
betweem both dates for each row
for name, df_g in df.groupby('group'):
df.loc[df_g.index,'test'] = df_g.apply(lambda row: (df_g['days_missing'][df_g['start_date']
.between(row['three_yrs_ago_date'], row['end_date'])].sum()),1)
df['test'] = df['test'].astype(int) #to get integer
the result is like expected:
id group start_date end_date three_yrs_ago_date days_missing test
0 1 23 2005-01-01 2006-01-01 2002-01-01 1 1
1 2 23 2006-01-06 2007-01-06 2003-01-06 6 7
2 3 23 2007-01-15 2008-01-15 2004-01-15 9 16
3 7 17 2014-01-01 2015-02-01 2011-01-01 2 2
4 7 23 2015-01-01 2016-02-01 2012-01-01 4 4
EDIT: faster way by using numpy
function:
import numpy as np
for name, df_g in df.groupby('group'):
m_g = ( np.less_equal.outer(df_g['three_yrs_ago_date'], df_g['start_date'])
& np.greater_equal.outer(df_g['end_date'], df_g['start_date']) )
df.loc[df_g.index,'test'] =np.dot(m_g, df_g['days_missing'])
df['test'] = df['test'].astype(int) #to get integer
Upvotes: 1
Reputation: 1343
Here's an attempt using .groupby
, .loc
and .transform
:
import numpy as np
conditions = (
(df['start_date'] >= df['three_yrs_ago_date'])
& (df['end_date'] <= df['start_date'])
)
df['test'] = np.nan # initiliaze column, otherwise next line raises KeyError
df.loc[conditions, 'test'] = df.loc[conditions, ].groupby('group')['days_missing'].transform('sum')
Upvotes: 1