Reputation: 941
I'm having a headache with performing a join on data frames through look up:
df1
,
Name ID1 StartDate1 EndDate1 StartDate2 EndDate2
ab 111 1/1/2017 3/1/2017 2/1/2017 4/1/2017
bc 112 2/1/2017 3/1/2017 2/1/2017 4/1/2017
cd 113 1/1/2017 2/1/2017 2/1/2017 4/1/2017
df 114 2/1/2017 4/1/2017 2/1/2017 4/1/2017
fe 115 4/1/2017 5/1/2017 2/1/2017 4/1/2017
df2
,
ID1 ID2 Date Id1_num id2_num
111 10 1/1/2017 10 11
111 10 2/1/2017 10
111 10 3/1/2017 13 13
111 10 4/1/2017 15 13
112 20 2/1/2017 11 19
112 20 3/1/2017 12
112 20 4/1/2017 19
113 20 1/1/2017 17 5
113 20 2/1/2017 17 14
114 30 2/1/2017 11 13
114 30 3/1/2017 10
114 30 4/1/2017 18 13
115 30 4/1/2017 7 5
115 30 5/1/2017 13 19
I want my result df3
,
Name ID1 ID2 StartDate1 EndDate1 StartDate2 EndDate2 ID1Date1Count ID1Date1Average ID1Date2Count ID1Date2Average ID2Date1Count ID2Date1Average ID2Date2Count ID2Date2Average
ab 111 10 1/1/2017 3/1/2017 2/1/2017 4/1/2017 2 11.5 2 14 3 11.33 3 12
bc 112 20 2/1/2017 3/1/2017 2/1/2017 4/1/2017 2 11.5 2 11.5 1 19 2 19
cd 113 20 1/1/2017 2/1/2017 2/1/2017 4/1/2017 2 17 1 17 1 14 1 14
df 114 30 2/1/2017 4/1/2017 2/1/2017 4/1/2017 3 13 3 13 2 13 2 13
fe 115 30 4/1/2017 5/1/2017 2/1/2017 4/1/2017 1 19 0 0 1 19 0 0
Where
ID1Date1Count
is the entries count from the date interval [StartDate1, EndDate1]
,
ID1Date2Count
is the entries count from the date interval [StartDate2, EndDate2]
ID2Date1Count
is the entries count from the date interval [StartDate1, EndDate1]
ID2Date2Count
is the entries count from the date interval [StartDate2, EndDate2]
Along with their average.
The extra condition here is that we do not include num
if num < 10
.
Thanks a bunch!
Upvotes: 1
Views: 75
Reputation: 323276
I broke down the steps, after we get the dd1
, we can merge
it back to your df1
, you can replace NaN
by using fillna(0)
Newdf=pd.wide_to_long(df1,stubnames=['StartDate','EndDate'],i=['Name','ID1'],j='nnumer')
Newdf.StartDate=pd.to_datetime(Newdf.StartDate)
Newdf.EndDate=pd.to_datetime(Newdf.EndDate)
Newdf['New']=Newdf[['StartDate','EndDate']].apply(lambda x : [pd.date_range(x['StartDate'], x['EndDate'],freq='MS').tolist()],1)['StartDate']
Newdf=Newdf.set_index(['StartDate','EndDate'],append=True).New.apply(pd.Series).stack()
Newdf=Newdf.to_frame('Date').reset_index()
df2.Date=pd.to_datetime(df2.Date)
dd=Newdf.merge(df2,on=['ID1','Date'])
dd[['Id1_num','id2_num']]=dd[['Id1_num','id2_num']].mask(dd[['Id1_num','id2_num']]<10)
dd1=dd.groupby(['ID1','nnumer','StartDate','EndDate'])['Id1_num','id2_num'].agg(['mean','count']).unstack(1).groupby(level='ID1').ffill().bfill().reset_index(['StartDate','EndDate'],drop=True).drop_duplicates()
dd1
Out[626]:
Id1_num id2_num
mean count mean count
nnumer 1 2 1 2 1 2 1 2
ID1
111 11.5 14.0 2.0 2.0 11.333333 12.0 3.0 3.0
112 11.5 11.5 2.0 2.0 19.000000 19.0 1.0 2.0
113 17.0 17.0 2.0 1.0 14.000000 14.0 1.0 1.0
114 13.0 13.0 3.0 3.0 13.000000 13.0 2.0 2.0
115 13.0 NaN 1.0 0.0 19.000000 NaN 1.0 0.0
Updated
dd1.columns=dd1.columns.map(''.join)
dd1
Out[650]:
Id1_nummean1 Id1_nummean2 Id1_numcount1 Id1_numcount2 id2_nummean1 \
ID1
111 11.5 14.0 2.0 2.0 11.333333
112 11.5 11.5 2.0 2.0 19.000000
113 17.0 17.0 2.0 1.0 14.000000
114 13.0 13.0 3.0 3.0 13.000000
115 13.0 NaN 1.0 0.0 19.000000
id2_nummean2 id2_numcount1 id2_numcount2
ID1
111 12.0 3.0 3.0
112 19.0 1.0 2.0
113 14.0 1.0 1.0
114 13.0 2.0 2.0
115 NaN 1.0 0.0
Upvotes: 1
Reputation: 1352
Not 100% clear on what you actually want (e.g. what is "num < 10", which num? Id1_num or id2_num?), but the below should get you close I believe.
first -- group df2 and format it into the date intervals:
df2_grouped = df2.groupby(['ID1', 'ID2']).\
apply(lambda x: pd.Series([x['Date'].min(), x['Date'].max(),
x['id2_num'].count()],
index=['StartDate2', 'EndDate2', 'ID1Date1Count']))
df2_grouped = df2_grouped.reset_index()
print(df2_grouped)
ID1 ID2 StartDate2 EndDate2 ID1Date1Count
0 111 10 1/1/2017 4/1/2017 3
1 112 20 2/1/2017 4/1/2017 1
2 113 20 1/1/2017 2/1/2017 2
3 114 30 2/1/2017 4/1/2017 2
4 115 30 4/1/2017 5/1/2017 2
Then -- merge it with your df1:
df1[['ID1', 'StartDate1', 'EndDate1']].merge(df2_grouped, on='ID1')
df1 = df1.sort_values(by=['StartDate2'])
print(df1)
ID1 StartDate1 EndDate1 ID2 StartDate2 EndDate2 ID1Date1Count
0 111 1/1/2017 3/1/2017 10 1/1/2017 4/1/2017 3
2 113 1/1/2017 2/1/2017 20 1/1/2017 2/1/2017 2
1 112 2/1/2017 3/1/2017 20 2/1/2017 4/1/2017 1
3 114 2/1/2017 4/1/2017 30 2/1/2017 4/1/2017 2
4 115 4/1/2017 5/1/2017 30 4/1/2017 5/1/2017 2
Again, the output doesn't exactly match your desired output, but its close.
Upvotes: 0