ashleych
ashleych

Reputation: 1054

How to find max of a column within a particular daterange using data.table

I have a dataset with contractID, data and DaysPastDue information. How do I look forward say 12 months for each row and identify the Max DPD, corresponding to that contractID. The data set looks like this

 Contract_number       Date    DPD
 1:               a 2014-03-01  14
 2:               a 2014-03-01   5
 3:               a 2014-10-01   6
 4:               a 2014-10-01  16
 5:               a 2015-12-01   4
 6:               a 2015-12-01  17
 7:               a 2016-09-01  16
 8:               a 2016-09-01  15
 9:               a 2016-10-01   3
10:               a 2016-10-01   8
11:               b 2014-05-01  18
12:               b 2014-05-01   9
13:               b 2014-08-01   2
14:               b 2014-08-01  14

Code for generating this dataset

library(data.table)
set.seed(123)
dummy_data= data.table(Contract_number = letters[1:4], 
                       Date = sample(seq(as.Date('2014/01/01'), as.Date('2016/12/01'), by="month"), 20),
                       DPD=sample.int(20:50, 40, replace = TRUE)
)
dummy_data[order(Contract_number,Date)]

I have a dplyr solution to this, wondering if there is a more concise datatable way to do this?

max_dpd_data<-dummy_data %>% left_join(dummy_data,dummy_data,by="Contract_number") %>% 
  filter(Date.y>Date.x & Date.y<=(Date.x + months(12))) %>%
  group_by(Contract_number, Date.x) %>% summarise(Max_DPD_12_M = max(DPD.y), N_Mnths_Future=n()) %>% 
  rename(Date='Date.x')
dummy_data1<- left_join(dummy_data,max_dpd_data,by = c("Contract_number","Date"))

I also do not want to go the route of using expand.grid to fill in missing months, and then using Shift.

Upvotes: 1

Views: 54

Answers (1)

Benjamin Christoffersen
Benjamin Christoffersen

Reputation: 4841

I figure you are looking for something like this

> library(data.table)
> set.seed(123)
> dummy_data= data.table(Contract_number = letters[1:4], 
+                        Date = sample(seq(as.Date('2014/01/01'), as.Date('2016/12/01'), by="month"), 20),
+                        DPD=sample.int(20:50, 40, replace = TRUE)
+ )
> 
> # You can useset key to sort
> setkey(dummy_data, Contract_number, Date)
> dummy_data
    Contract_number       Date DPD
 1:               a 2014-05-01  16
 2:               a 2014-05-01   3
 3:               a 2014-11-01  18
 4:               a 2014-11-01   3
 5:               a 2015-05-01  14
 6:               a 2015-05-01  16
 7:               a 2016-07-01  14
 8:               a 2016-07-01   4
 9:               a 2016-09-01   6
10:               a 2016-09-01   6
11:               b 2014-01-01   5
12:               b 2014-01-01  16
13:               b 2014-02-01  15
14:               b 2014-02-01   3
15:               b 2015-01-01   3
16:               b 2015-01-01  18
17:               b 2016-04-01  14
18:               b 2016-04-01   9
19:               b 2016-10-01  16
20:               b 2016-10-01   3
21:               c 2014-03-01   1
22:               c 2014-03-01  12
23:               c 2014-06-01   7
24:               c 2014-06-01  18
25:               c 2015-02-01  13
26:               c 2015-02-01   9
27:               c 2015-04-01  11
28:               c 2015-04-01   5
29:               c 2016-01-01  20
30:               c 2016-01-01   1
31:               d 2014-12-01  19
32:               d 2014-12-01   9
33:               d 2015-07-01  10
34:               d 2015-07-01   5
35:               d 2015-12-01   5
36:               d 2015-12-01   8
37:               d 2016-02-01  12
38:               d 2016-02-01  10
39:               d 2016-06-01  20
40:               d 2016-06-01   8
    Contract_number       Date DPD
> 
> # Add yearmonth decimal column
> dummy_data[, ym := as.integer(format(Date, "%Y%m"))][
+   , ym := (ym %/% 100) + (ym %% 100) / 12][, ym_less_one := ym - 1][
+     , ym2 := ym]
> 
> dummy_data <- dummy_data[
+   dummy_data, on = c("Contract_number", "ym>ym", "ym_less_one<=ym2"),
+   .(Date = first(i.Date), DPD = first(i.DPD), max_DPD = max(DPD)), 
+   by =.EACHI][, c("ym", "ym_less_one") := NULL]
> 
> print(dummy_data)
    Contract_number       Date DPD max_DPD
 1:               a 2014-05-01  16      18
 2:               a 2014-05-01   3      18
 3:               a 2014-11-01  18      16
 4:               a 2014-11-01   3      16
 5:               a 2015-05-01  14      NA
 6:               a 2015-05-01  16      NA
 7:               a 2016-07-01  14       6
 8:               a 2016-07-01   4       6
 9:               a 2016-09-01   6      NA
10:               a 2016-09-01   6      NA
11:               b 2014-01-01   5      18
12:               b 2014-01-01  16      18
13:               b 2014-02-01  15      18
14:               b 2014-02-01   3      18
15:               b 2015-01-01   3      NA
16:               b 2015-01-01  18      NA
17:               b 2016-04-01  14      16
18:               b 2016-04-01   9      16
19:               b 2016-10-01  16      NA
20:               b 2016-10-01   3      NA
21:               c 2014-03-01   1      18
22:               c 2014-03-01  12      18
23:               c 2014-06-01   7      13
24:               c 2014-06-01  18      13
25:               c 2015-02-01  13      20
26:               c 2015-02-01   9      20
27:               c 2015-04-01  11      20
28:               c 2015-04-01   5      20
29:               c 2016-01-01  20      NA
30:               c 2016-01-01   1      NA
31:               d 2014-12-01  19      10
32:               d 2014-12-01   9      10
33:               d 2015-07-01  10      20
34:               d 2015-07-01   5      20
35:               d 2015-12-01   5      20
36:               d 2015-12-01   8      20
37:               d 2016-02-01  12      20
38:               d 2016-02-01  10      20
39:               d 2016-06-01  20      NA
40:               d 2016-06-01   8      NA
    Contract_number       Date DPD max_DPD

I am not sure whether or not you want the month of the observation within the 12 month period. Further, there might be some issues with the >= operations + floating point issues. A solution is maybe to subtract some_factor * .Machine$double.eps from the ym_less_one column.

Upvotes: 1

Related Questions