Reputation: 3419
I have the following dataframe, each row represent a sale transaction:
startDate INDEX_250 priceDeal
2013-05-02 00:00:00 9312.000 255000.000
2013-09-17 00:00:00 11121.000 368209.000
2013-10-09 00:00:00 11121.000 254000.000
2013-11-14 00:00:00 11121.000 520000.000
2013-11-22 00:00:00 11121.000 201000.000
2014-02-05 00:00:00 11121.000 260000.000
2014-02-28 00:00:00 11121.000 425000.000
2014-03-01 00:00:00 11121.000 315000.000
2014-03-11 00:00:00 9312.000 427000.000
2014-04-27 00:00:00 9312.000 138070.000
2014-06-20 00:00:00 9312.000 270000.000
2014-07-21 00:00:00 9312.000 282000.000
2014-07-31 00:00:00 9312.000 308806.350
2014-09-27 00:00:00 11121.000 170000.000
2014-10-05 00:00:00 11121.000 171658.220
2014-10-11 00:00:00 11121.000 292000.000
2014-10-13 00:00:00 11121.000 125000.000
2014-10-30 00:00:00 9312.000 95000.000
2014-11-18 00:00:00 9312.000 158942.280
2015-01-25 22:00:00 11121.000 238829.370
2015-03-11 00:00:00 11121.000 180695.960
2015-03-14 00:00:00 9312.000 320932.860
2015-03-21 00:00:00 11121.000 139872.000
2015-09-04 00:00:00 11121.000 140000.000
2015-09-09 00:00:00 9312.000 235000.000
i want to see how many sales transactions happened in previous 30 day nrTargets_gr_250_30
and 60 day nrTargets_gr_250_60
for each (grouped by) INDEX_250
from the first sale transaction 2013-09-17
. example dataset is only for one INDEX_250
but there are other index numbers as well, following out put is desirable:
startDate INDEX_250 nrTargets_gr_250_30 nrTargets_gr_250_60
2013-10-17 11121 2.000 2.000
2013-11-16 11121 1.000 3.000
2013-12-16 11121 1.000 2.000
2014-01-15 11121 0.000 1.000
2014-02-14 11121 1.000 1.000
2014-03-16 11121 2.000 3.000
2014-04-15 11121 0.000 2.000
2014-10-12 11121 3.000 3.000
2014-11-11 11121 1.000 4.000
2014-12-11 11121 0.000 1.000
2015-02-09 11121 1.000 1.000
2015-03-11 11121 0.000 1.000
2015-04-10 11121 2.000 2.000
2015-05-10 11121 0.000 2.000
2015-09-07 11121 1.000 1.000
2015-10-07 11121 0.000 1.000
2016-02-04 11121 1.000 1.000
2016-03-05 11121 0.000 1.000
2017-01-29 11121 1.000 1.000
Upvotes: 2
Views: 119
Reputation: 31011
I noticed that you want startDate changed to the end of the respective 30-days period, starting from your start date.
Another detail is that you want the result grouped by INDEX_250 - how many previous transactions (in both respective periods) were with the same value of INDEX_250.
Note also that rolling calculations can be performed on a window containing rows from a number of future periods, whereas you want the number of transactions from previous 30 or 60 days and rolling does not allow negative number of periods.
This is why I took a different approach than "ordinary" rolling.
Start from auxiliary variables:
td30 = pd.Timedelta('30D')
dRng = pd.date_range(start='2013-09-17', end=df.startDate.max() + td30,
freq='30D', closed='left')
Then define the following function computing both targets:
def targets(grp):
grp['Prd'] = grp.startDate.apply(lambda x: dRng.asof(x) + td30)
grp.set_index('Prd', inplace=True)
trg30 = grp.groupby(level=0).INDEX_250.count()\
.rename('nrTargets_gr_250_30').reindex(dRng, fill_value=0)
trg60 = trg30.rolling(2).sum().rename('nrTargets_gr_250_60')\
.fillna(0, downcast='infer')
trg30 = trg30[trg30 > 0]
trg60 = trg60[trg60 > 0]
return trg30.to_frame().join(trg60, how='outer')\
.fillna(0, downcast='infer').rename_axis('startDate')
Apply it and reset indices (just in this order, to have proper ordering of columns):
df2 = df[df.startDate >= '2013-09-17'].groupby('INDEX_250')\
.apply(targets).reset_index(level=[0]).reset_index()
Notes:
And the last thing is to change the type of INDEX_250 to int:
df2.INDEX_250 = df2.INDEX_250.astype(int)
The result for INDEX_250 group is just as you specified, excluding result rows from 2016 and 2017, which were not included in your sample data.
To extend the result by average prices for each "final" date and both targets, two changes are required.
First, define another function to "reformat" target DataFrame:
def trgReformat(trg):
trg = trg[trg.nrTargets_gr_250 > 0].copy()
trg['avgPrice'] = trg.sm / trg.nrTargets_gr_250
return trg.drop(columns='sm')
Second, define targets function as:
def targets(grp):
grp['Prd'] = grp.startDate.apply(lambda x: dRng.asof(x) + td30)
grp.set_index('Prd', inplace=True)
trg30 = grp.groupby(level=0).agg(
nrTargets_gr_250=('INDEX_250', 'count'), sm=('priceDeal', 'sum'))\
.reindex(dRng, fill_value=0)
trg60 = trg30.rolling(2).sum().fillna(0, downcast='infer')
trg30 = trgReformat(trg30)
trg60 = trgReformat(trg60)
return trg30.join(trg60, how='outer', lsuffix='_30', rsuffix='_60')\
.fillna(0, downcast='infer').rename_axis('startDate')
This function uses named aggregations, to compute:
The reason is that computing of trg60 is performed using rolling (for 2 consecutive 30-day periods), so average alone would be not sufficient here.
Computation of the average price can be performed as late as in reformatting of each target.
Application of this function is just as before.
Upvotes: 1