Reputation: 29
I have a df with start dates and end dates of some process. Replication data:
ID1 <- c("AUT","AUT","AUT","BEL","BEL","BEL")
start_date <- c("2008-12-02", "2013-12-16", "2016-05-17", "2007-06-10", "2007-12-21", "2008-03-20")
start_date <- as.Date(start_date, "%Y-%m-%d")
end_date <- c("2013-12-15", "2016-05-16", "2017-11-30", "2007-12-20", "2008-03-19", "2008-12-29")
end_date <- as.Date(end_date, "%Y-%m-%d")
ID2 <- 1:6
df <- data.frame(ID1, as.character(start_date), as.character(end_date), ID2)
Which looks like:
ID1 start_date end_date ID2
1 AUT 2008-12-02 2013-12-15 1
2 AUT 2013-12-16 2016-05-16 2
3 AUT 2016-05-17 2017-11-30 3
4 BEL 2007-06-10 2007-12-20 4
5 BEL 2007-12-21 2008-03-19 5
6 BEL 2008-03-20 2008-12-29 6
I want to add a new column and rows for every year in the given periods to this df: the column should be year
, the value of which depends on whether the first day of the year (20XX-01-01) is inside or outside the process bounds. What i'd like to see is:
ID1 start_date end_date ID2 year
1 AUT 2008-12-02 2013-12-15 1 2009
2 AUT 2008-12-02 2013-12-15 1 2010
2 AUT 2008-12-02 2013-12-15 1 2011
4 AUT 2008-12-02 2013-12-15 1 2012
5 AUT 2008-12-02 2013-12-15 1 2013
6 AUT 2013-12-16 2016-05-16 2 2014
7 AUT 2013-12-16 2016-05-16 2 2015
8 AUT 2013-12-16 2016-05-16 2 2016
9 AUT 2016-05-17 2017-11-30 3 2017
10 BEL 2007-06-10 2007-12-20 4 NA
11 BEL 2007-12-21 2008-03-19 5 2008
12 BEL 2008-03-20 2008-12-29 6 NA
Edit: minor suggested changes to enhance clarity
Upvotes: 1
Views: 161
Reputation: 1795
So I modified the initial code slightly :
library(lubridate)
library(data.table)
ID1 <- c("AUT","AUT","AUT","BEL","BEL","BEL")
start_date <- ymd("2008-12-02", "2013-12-16", "2016-05-17", "2007-06-10", "2007-12-21", "2008-03-20")
end_date <- ymd("2013-12-15", "2016-05-16", "2017-11-30", "2007-12-20", "2008-03-19", "2008-12-29")
ID2 <- 1:6
df <- data.table(ID1, start_date, end_date, ID2)
df[,yearDiff:=year(end_date)-year(start_date)]
df<-df[,cbind(.SD,year=(year(start_date)+1):year(end_date)),by="ID2"]
df[,dateInterval:=interval(df$start_date,df$end_date)]
df[,IsYearWithinDate:=(ymd(paste0(year,"01","01",sep="-"))%within% dateInterval)]
df[,.(ID1,start_date,end_date,ID2,year,IsYearWithinDate)]
resulting in (via df[,.(ID1,start_date,end_date,ID2,year
):
ID1 start_date end_date ID2 year
1: AUT 2008-12-02 2013-12-15 1 2009
2: AUT 2008-12-02 2013-12-15 1 2010
3: AUT 2008-12-02 2013-12-15 1 2011
4: AUT 2008-12-02 2013-12-15 1 2012
5: AUT 2008-12-02 2013-12-15 1 2013
6: AUT 2013-12-16 2016-05-16 2 2014
7: AUT 2013-12-16 2016-05-16 2 2015
8: AUT 2013-12-16 2016-05-16 2 2016
9: AUT 2016-05-17 2017-11-30 3 2017
10: BEL 2007-06-10 2007-12-20 4 2008
11: BEL 2007-06-10 2007-12-20 4 2007
12: BEL 2007-12-21 2008-03-19 5 2008
13: BEL 2008-03-20 2008-12-29 6 2009
14: BEL 2008-03-20 2008-12-29 6 2008
Upvotes: 1