Reputation: 1193
I have a data table with start date and end date that I want to reshape so that it repeats all the information for each single date in the interval between the start and end date.
My data is as follows
tripstart tripend Country
1: 2014-10-07 2014-10-10 US
2: 2013-06-12 2013-06-13 FR
3: 2013-02-07 2013-02-10 DK
Based on this data the result I am looking for would look similar to
Day Country
2014-10-10 US
2014-10-09 US
2014-10-08 US
2014-10-07 US
2013-06-13 FR
2013-06-12 FR
2013-02-10 DK
2013-02-09 DK
2013-02-08 DK
2013-02-07 DK
I tried the following without success,
setkey(hotel_stays, tripstart, tripend)
# Get the first date that was used as transaction date.
max_date <- max(hotel_stays$tripend, hotel_stays$tripstart)
min_date <- min(hotel_stays$tripend, hotel_stays$tripstart)
hotel_stays_long <- data.table(day = seq.Date(min_date, to = max_date,, length.out = max_date - min_date))
setkey(hotel_stays_long, day)
foverlaps(hotel_stays, hotel_stays_long)
R code for the data:
hotel_stays <- data.table(tripstart = c(as.Date("2014-10-07"), as.Date("2013-06-12"), as.Date("2013-02-07")), tripend = c(as.Date("2014-10-10"), as.Date("2013-06-13"), as.Date("2013-02-10")), Country = c("US", "FR", "DK"))
Upvotes: 1
Views: 123
Reputation: 1193
Thanks to Frank I have two solutions.
hotel_stays <- data.table(tripstart = c(as.Date("2014-10-07"), as.Date("2013-06-12"), as.Date("2013-02-07")), tripend = c(as.Date("2014-10-10"), as.Date("2013-06-13"), as.Date("2013-02-10")), Country = c("US", "FR", "DK"))
### Solution 1
setkey(hotel_stays, tripstart, tripend)
# Get the first date that was used as transaction date.
max_date <- max(hotel_stays$tripend, hotel_stays$tripstart)
min_date <- min(hotel_stays$tripend, hotel_stays$tripstart)
hotel_stays_long <- data.table(day = seq.Date(min_date, to = max_date,, length.out = max_date - min_date))
hotel_stays_long[, end := day]
setkey(hotel_stays_long, day, end)
hotel_stays_long <- foverlaps(hotel_stays, hotel_stays_long)
hotel_stays_long[, c("end", "tripstart", "tripend") := NULL]
## Solution 2
hotel_stays_long[, .(day = seq(tripstart, tripend, by = "day"), Country),
by = 1 : nrow(hotel_stays_long)]
I ran both examples on a private data set that contains some additional columns. The info on this data set is,
> dim(hotel_stays)
[1] 4675 28
The first solution leads to
replications elapsed relative user.self sys.self user.child sys.child
1 100 1.898 1 1.889 0.005 0 0
The second solution leads to
replications elapsed relative user.self sys.self user.child sys.child
1 100 45.244 1 45.253 0 0 0
The test environment is
> sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-unknown-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server release 6.6 (Santiago)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] rbenchmark_1.0.0 data.table_1.9.5 RODBC_1.3-11
loaded via a namespace (and not attached):
[1] tools_3.2.0 chron_2.3-45
Conclusion, the first solution is much faster but less elegant.
Upvotes: 1