Reputation: 313
I want to populate the empty columns 'web' 'mob 'app' by summing for each of the relevant dates in df2
df1:
id start end web mob app
12345 2018-01-17 2018-01-20
12346 2018-01-19 2018-01-22
12347 2018-01-20 2018-01-23
12348 2018-01-20 2018-01-23
12349 2018-01-21 2018-01-24
df2:
id date web mob app
12345 2018-01-17 7 17 10
12345 2018-01-18 9 18 7
12345 2018-01-19 3 19 15
12345 2018-01-20 6 17 8
12345 2018-01-21 8 9 13
12345 2018-01-22 4 15 12
12345 2018-01-23 8 11 13
12345 2018-01-24 9 16 14
12346 2018-01-17 3 17 12
12346 2018-01-18 4 19 4
12346 2018-01-19 6 13 10
12346 2018-01-20 1 15 6
12346 2018-01-21 4 12 11
12346 2018-01-22 5 20 12
12346 2018-01-23 8 13 14
12346 2018-01-24 6 18 8
This for loop will populate the 'web' column:
column = []
for i in df1.index:
column.append(df2[(df2['date'] >= df1['start'].iloc[i])
& (df2['date'] <= df1['end'].iloc[i])
& (df2['id'] == df1['id'].iloc[i])].sum()['web'])
df1['web'] = column
I want to be able to populate all 3 columns with one for loop, rather than doing 3 separate loops.
I have a feeling that using something like appending this
.agg({'web':'sum', 'mob':'sum', 'app':'sum'})
to a 2 dimensional list could be the answer.
Also... is there a more efficient way to do this than using for loops? Maybe by using numpy.where? I'm finding that running multiple for loops over large data sets can be very very slow.
Upvotes: 4
Views: 3888
Reputation: 164623
This is one way, but it is not "pandonic". It assumes your date columns are already converted to datetime
. But use @Wen's vectorised solution.
def filtersum(row):
result = [(w, m, a) for i, w, m, a, d in \
zip(df2.id, df2.web, df2.mob, df2.app, df2.date) \
if i == row['id'] and (row['start'] <= d <= row['end'])]
return [sum(i) for i in (zip(*result))] if result else [0, 0, 0]
df1[['web', 'mob', 'app']] = df1.apply(filtersum, axis=1)
# id start end web mob app
# 0 12345 2018-01-17 2018-01-20 25 71 40
# 1 12346 2018-01-19 2018-01-22 16 60 39
# 2 12347 2018-01-20 2018-01-23 0 0 0
# 3 12348 2018-01-20 2018-01-23 0 0 0
# 4 12349 2018-01-21 2018-01-24 0 0 0
Upvotes: 0
Reputation: 323226
IIUC
s=df1.merge(df2,on='id',how='left')
output=s[(s.start<=s.date)&(s.end>=s.date)].groupby('id').sum()
output
Out[991]:
web mob app
id
12345 25.0 71.0 40.0
12346 16.0 60.0 39.0
Then we using merge
again
df1.merge(output.reset_index(),how='left').fillna(0)
Out[995]:
id start end web mob app
0 12345 2018-01-17 2018-01-20 25.0 71.0 40.0
1 12346 2018-01-19 2018-01-22 16.0 60.0 39.0
2 12347 2018-01-20 2018-01-23 0.0 0.0 0.0
3 12348 2018-01-20 2018-01-23 0.0 0.0 0.0
4 12349 2018-01-21 2018-01-24 0.0 0.0 0.0
Upvotes: 5