Reputation: 133
data={'id':[1,1,1,1,2,2,2,2],
'date1':[datetime.date(2016,1,1),datetime.date(2016,1,2),datetime.date(2016,1,2),datetime.date(2016,1,4),
datetime.date(2016,1,2),datetime.date(2016,1,4),datetime.date(2016,1,3),datetime.date(2016,1,1)],
'date2':[datetime.date(2016,1,5),datetime.date(2016,1,3),datetime.date(2016,1,5),datetime.date(2016,1,5),
datetime.date(2016,1,4),datetime.date(2016,1,5),datetime.date(2016,1,4),datetime.date(2016,1,1)],
'score1':[5,7,3,2,9,3,8,3],
'score2':[1,3,0,5,2,20,7,7]}
df=pd.DataFrame.from_dict(data)
And dataframe df looks like this:
id date1 date2 score1 score2
0 1 2016-01-01 2016-01-05 5 1
1 1 2016-01-02 2016-01-03 7 3
2 1 2016-01-02 2016-01-05 3 0
3 1 2016-01-04 2016-01-05 2 5
4 2 2016-01-02 2016-01-04 9 2
5 2 2016-01-04 2016-01-05 3 20
6 2 2016-01-03 2016-01-04 8 7
7 2 2016-01-01 2016-01-01 3 7
Another dataframe UF that contains ID, usetdate
id usetdate
0 1 2016-01-01
1 1 2016-01-03
2 2 2016-01-04
3 2 2016-01-02
I want to find sum of score1 and score2 against each ID of UF if my date(usetdate) passed is between date1 and date2
id usetdate score1 score2
0 1 2016-01-01 5 1
1 1 2016-01-03 17 9
2 2 2016-01-04 20 29
3 2 2016-01-02 9 2
Upvotes: 2
Views: 525
Reputation: 765
df1.sql.set_alias('df1').join(df2.sql.set_alias('df2'),condition='df1.id=df2.id').filter("usetdate between date1 and date2").aggregate("df2.id,usetdate,sum(score1) score1,sum(score2) score2")
┌───────┬────────────┬────────┬────────┐
│ id │ usetdate │ score1 │ score2 │
│ int64 │ varchar │ int128 │ int128 │
├───────┼────────────┼────────┼────────┤
│ 1 │ 2016-01-03 │ 15 │ 4 │
│ 1 │ 2016-01-01 │ 5 │ 1 │
│ 2 │ 2016-01-04 │ 20 │ 29 │
│ 2 │ 2016-01-02 │ 9 │ 2 │
└───────┴────────────┴────────┴────────┘
Upvotes: 0
Reputation: 11
import datetime
import pandas as pd
data={'id':[1,1,1,1,2,2,2,2],
'date1':[datetime.date(2016,1,1),datetime.date(2016,1,2),datetime.date(2016,1,2),datetime.date(2016,1,4),
datetime.date(2016,1,2),datetime.date(2016,1,4),datetime.date(2016,1,3),datetime.date(2016,1,1)],
'date2':[datetime.date(2016,1,5),datetime.date(2016,1,3),datetime.date(2016,1,5),datetime.date(2016,1,5),
datetime.date(2016,1,4),datetime.date(2016,1,5),datetime.date(2016,1,4),datetime.date(2016,1,1)],
'score1':[5,7,3,2,9,3,8,3],
'score2':[1,3,0,5,2,20,7,7]}
df=pd.DataFrame.from_dict(data)
data={'id':[1,1,2,2],
'date':[datetime.date(2016,1,1),datetime.date(2016,1,2),datetime.date(2016,1,2),datetime.date(2016,1,4)]}
df1=pd.DataFrame.from_dict(data)
data1=[]
x=[]
def agg(df,df1):
for i in range(1,2):
x=list(df1.id)
y=list(df1.date)
data1= df[df.id==x[i]]
data2=data1[data1.date1 > y[i]]
data3=data2[data1.date2 < y[i]]
data4=data3.groupby(['id']).agg({"score1":sum})
x.append(data4)
return data4
agg(df,df1)
Please try this
Upvotes: 1
Reputation: 862511
You can create Series
with all datetimes first by date_range
, swap index with values in Series
and DataFrame.join
to original, last aggregate sum
:
s = pd.concat([pd.Series(r.Index,pd.date_range(r.date1, r.date2)) for r in df.itertuples()])
s = pd.Series(s.index, index=s, name='usetdate')
df = df.drop(['date1','date2'],axis=1).join(s).groupby(['id','usetdate'], as_index=False).sum()
print (df)
id usetdate score1 score2
0 1 2016-01-01 5 1
1 1 2016-01-02 15 4
2 1 2016-01-03 15 4
3 1 2016-01-04 10 6
4 1 2016-01-05 10 6
5 2 2016-01-01 3 7
6 2 2016-01-02 9 2
7 2 2016-01-03 17 9
8 2 2016-01-04 20 29
9 2 2016-01-05 3 20
EDIT:
L = [(i, d, s1, s2) for i, d1, d2, s1, s2 in df.values for d in pd.date_range(d1, d2)]
df = (pd.DataFrame(L, columns=['id','usetdate','score1','score2'])
.groupby(['id','usetdate'], as_index=False).sum())
print (df)
id usetdate score1 score2
0 1 2016-01-01 5 1
1 1 2016-01-02 15 4
2 1 2016-01-03 15 4
3 1 2016-01-04 10 6
4 1 2016-01-05 10 6
5 2 2016-01-01 3 7
6 2 2016-01-02 9 2
7 2 2016-01-03 17 9
8 2 2016-01-04 20 29
9 2 2016-01-05 3 20
EDIT:
You can merge
values with left join before aggregating:
df1['userdate'] = pd.to_datetime(df1['userdate'])
print (df1)
id userdate
0 1 2016-01-01
1 1 2016-01-03
2 2 2016-01-04
3 2 2016-01-02
L = [(i, d, s1, s2) for i, d1, d2, s1, s2 in df.values for d in pd.date_range(d1, d2)]
df = (pd.DataFrame(L, columns=['id','userdate','score1','score2'])
.merge(df1)
.groupby(['id','userdate'], as_index=False)
.sum())
print (df)
id userdate score1 score2
0 1 2016-01-01 5 1
1 1 2016-01-03 15 4
2 2 2016-01-02 9 2
3 2 2016-01-04 20 29
EDIT1:
You can filter values in list comprehension converted to tuples:
df1['userdate'] = pd.to_datetime(df1['userdate'])
print (df1)
id userdate
0 1 2016-01-01
1 1 2016-01-03
2 2 2016-01-04
3 2 2016-01-02
a = [tuple(x) for x in df1.values]
print (a)
[(1, Timestamp('2016-01-01 00:00:00')), (1, Timestamp('2016-01-03 00:00:00')),
(2, Timestamp('2016-01-04 00:00:00')), (2, Timestamp('2016-01-02 00:00:00'))]
L = [(i, d, s1, s2) for i, d1, d2, s1, s2 in df.values
for d in pd.date_range(d1, d2)
if (i, d) in a]
df = (pd.DataFrame(L, columns=['id','userdate','score1','score2'])
.groupby(['id','userdate'], as_index=False)
.sum())
print (df)
id userdate score1 score2
0 1 2016-01-01 5 1
1 1 2016-01-03 15 4
2 2 2016-01-02 9 2
3 2 2016-01-04 20 29
Upvotes: 1