Reputation: 2623
I have a data.table, dt, that spans multiple years. The data is grouped so each group has different length of years. I only want to keep the first three years of each group. How can I do this with data.table? Here is some data to test:
dates <- c(seq(as.Date('2010-01-03'),as.Date('2019-12-31'),by = 1),
seq(as.Date('2013-01-02'),as.Date('2018-12-31'),by = 1),
seq(as.Date('2015-01-02'),as.Date('2020-07-31'),by = 1))
set.seed(1995)
value <- rnorm(length(dates), mean = 100, sd = 50)
IDs <- c(rep(c("ACG"),length.out = length(seq(as.Date('2010-01-03'),as.Date('2019-12-31'),by = 1))),
rep(c("MKD"),length.out = length(seq(as.Date('2013-01-02'),as.Date('2018-12-31'),by = 1))),
rep(c("ZED"),length.out = length(seq(as.Date('2015-01-02'),as.Date('2020-07-31'),by = 1)))
)
dt <- data.table(Date = dates,
Value = value,
ID = IDs
)
dt
Date Value ID
1: 2010-01-03 153.03816 ACG
2: 2010-01-04 83.22491 ACG
3: 2010-01-05 107.26521 ACG
4: 2010-01-06 119.70395 ACG
5: 2010-01-07 183.24604 ACG
---
7874: 2020-07-27 184.45801 ZED
7875: 2020-07-28 91.53373 ZED
7876: 2020-07-29 67.42443 ZED
7877: 2020-07-30 125.62496 ZED
7878: 2020-07-31 89.02373 ZED
The final data.table should only have the first three years of data like this:
finalDT <- dt[c(1:1094,3651:4744,5841:6935),]
finalDT
Date Value ID
1: 2010-01-03 153.03816 ACG
2: 2010-01-04 83.22491 ACG
3: 2010-01-05 107.26521 ACG
4: 2010-01-06 119.70395 ACG
5: 2010-01-07 183.24604 ACG
---
3279: 2017-12-27 102.10622 ZED
3280: 2017-12-28 94.97718 ZED
3281: 2017-12-29 131.47358 ZED
3282: 2017-12-30 112.83836 ZED
3283: 2017-12-31 184.54966 ZED
The approach I've use works fine on a smallish dataset, but I have over a hundred IDs, some with 20 years of data. I need a programmatic approach in data.table
Upvotes: 1
Views: 82
Reputation: 33498
dt[, .SD[year(Date) %in% unique(year(Date))[1:3]], by = ID]
Or
dt[, .SD[year(Date) <= unique(year(Date))[3]], by = ID]
Make sure to order by date first.
Upvotes: 3