chessosapiens
chessosapiens

Reputation: 3419

how to apply time windows with different lengths to a pandas dataframe

I have the following dataframe, each row represent a sale transaction:

startDate           INDEX_250   priceDeal
2013-05-02 00:00:00 9312.000    255000.000
2013-09-17 00:00:00 11121.000   368209.000
2013-10-09 00:00:00 11121.000   254000.000
2013-11-14 00:00:00 11121.000   520000.000
2013-11-22 00:00:00 11121.000   201000.000
2014-02-05 00:00:00 11121.000   260000.000
2014-02-28 00:00:00 11121.000   425000.000
2014-03-01 00:00:00 11121.000   315000.000
2014-03-11 00:00:00 9312.000    427000.000
2014-04-27 00:00:00 9312.000    138070.000
2014-06-20 00:00:00 9312.000    270000.000
2014-07-21 00:00:00 9312.000    282000.000
2014-07-31 00:00:00 9312.000    308806.350
2014-09-27 00:00:00 11121.000   170000.000
2014-10-05 00:00:00 11121.000   171658.220
2014-10-11 00:00:00 11121.000   292000.000
2014-10-13 00:00:00 11121.000   125000.000
2014-10-30 00:00:00 9312.000    95000.000
2014-11-18 00:00:00 9312.000    158942.280
2015-01-25 22:00:00 11121.000   238829.370
2015-03-11 00:00:00 11121.000   180695.960
2015-03-14 00:00:00 9312.000    320932.860
2015-03-21 00:00:00 11121.000   139872.000
2015-09-04 00:00:00 11121.000   140000.000
2015-09-09 00:00:00 9312.000    235000.000

i want to see how many sales transactions happened in previous 30 day nrTargets_gr_250_30 and 60 day nrTargets_gr_250_60 for each (grouped by) INDEX_250 from the first sale transaction 2013-09-17. example dataset is only for one INDEX_250 but there are other index numbers as well, following out put is desirable:

      startDate  INDEX_250  nrTargets_gr_250_30 nrTargets_gr_250_60
      2013-10-17    11121   2.000               2.000
      2013-11-16    11121   1.000               3.000
      2013-12-16    11121   1.000               2.000
      2014-01-15    11121   0.000               1.000
      2014-02-14    11121   1.000               1.000
      2014-03-16    11121   2.000               3.000
      2014-04-15    11121   0.000               2.000
      2014-10-12    11121   3.000               3.000
      2014-11-11    11121   1.000               4.000
      2014-12-11    11121   0.000               1.000
      2015-02-09    11121   1.000               1.000
      2015-03-11    11121   0.000               1.000
      2015-04-10    11121   2.000               2.000
      2015-05-10    11121   0.000               2.000
      2015-09-07    11121   1.000               1.000
      2015-10-07    11121   0.000               1.000
      2016-02-04    11121   1.000               1.000
      2016-03-05    11121   0.000               1.000
      2017-01-29    11121   1.000               1.000

Upvotes: 2

Views: 119

Answers (1)

Valdi_Bo
Valdi_Bo

Reputation: 31011

I noticed that you want startDate changed to the end of the respective 30-days period, starting from your start date.

Another detail is that you want the result grouped by INDEX_250 - how many previous transactions (in both respective periods) were with the same value of INDEX_250.

Note also that rolling calculations can be performed on a window containing rows from a number of future periods, whereas you want the number of transactions from previous 30 or 60 days and rolling does not allow negative number of periods.

This is why I took a different approach than "ordinary" rolling.

Start from auxiliary variables:

td30 = pd.Timedelta('30D')
dRng = pd.date_range(start='2013-09-17', end=df.startDate.max() + td30,
    freq='30D', closed='left')

Then define the following function computing both targets:

def targets(grp):
    grp['Prd'] = grp.startDate.apply(lambda x: dRng.asof(x) + td30)
    grp.set_index('Prd', inplace=True)
    trg30 = grp.groupby(level=0).INDEX_250.count()\
        .rename('nrTargets_gr_250_30').reindex(dRng, fill_value=0)
    trg60 = trg30.rolling(2).sum().rename('nrTargets_gr_250_60')\
        .fillna(0, downcast='infer')
    trg30 = trg30[trg30 > 0]
    trg60 = trg60[trg60 > 0]
    return trg30.to_frame().join(trg60, how='outer')\
        .fillna(0, downcast='infer').rename_axis('startDate')

Apply it and reset indices (just in this order, to have proper ordering of columns):

df2 = df[df.startDate >= '2013-09-17'].groupby('INDEX_250')\
    .apply(targets).reset_index(level=[0]).reset_index()

Notes:

  • I took only rows with startDate on or after the start date you specified (2013-09-17).
  • The type of both target columns is int. I think it is more natural, as these column contain the number of transactions, which by nature is just an integer.

And the last thing is to change the type of INDEX_250 to int:

df2.INDEX_250 = df2.INDEX_250.astype(int)

The result for INDEX_250 group is just as you specified, excluding result rows from 2016 and 2017, which were not included in your sample data.

Extended version - with average prices

To extend the result by average prices for each "final" date and both targets, two changes are required.

First, define another function to "reformat" target DataFrame:

def trgReformat(trg):
    trg = trg[trg.nrTargets_gr_250 > 0].copy()
    trg['avgPrice'] = trg.sm / trg.nrTargets_gr_250
    return trg.drop(columns='sm')

Second, define targets function as:

def targets(grp):
    grp['Prd'] = grp.startDate.apply(lambda x: dRng.asof(x) + td30)
    grp.set_index('Prd', inplace=True)
    trg30 = grp.groupby(level=0).agg(
        nrTargets_gr_250=('INDEX_250', 'count'), sm=('priceDeal', 'sum'))\
        .reindex(dRng, fill_value=0)
    trg60 = trg30.rolling(2).sum().fillna(0, downcast='infer')
    trg30 = trgReformat(trg30)
    trg60 = trgReformat(trg60)
    return trg30.join(trg60, how='outer', lsuffix='_30', rsuffix='_60')\
        .fillna(0, downcast='infer').rename_axis('startDate')

This function uses named aggregations, to compute:

  • nrTargets_gr_250 - the number of rows,
  • sm - sum of prices.

The reason is that computing of trg60 is performed using rolling (for 2 consecutive 30-day periods), so average alone would be not sufficient here.

Computation of the average price can be performed as late as in reformatting of each target.

Application of this function is just as before.

Upvotes: 1

Related Questions