Stereo
Stereo

Reputation: 1193

spread data over an interval

I have a data table with start date and end date that I want to reshape so that it repeats all the information for each single date in the interval between the start and end date.

My data is as follows

       tripstart    tripend Country
1: 2014-10-07 2014-10-10      US
2: 2013-06-12 2013-06-13      FR
3: 2013-02-07 2013-02-10      DK

Based on this data the result I am looking for would look similar to

Day          Country
2014-10-10   US
2014-10-09   US
2014-10-08   US
2014-10-07   US
2013-06-13   FR
2013-06-12   FR
2013-02-10   DK
2013-02-09   DK
2013-02-08   DK
2013-02-07   DK

I tried the following without success,

setkey(hotel_stays, tripstart, tripend)

# Get the first date that was used as transaction date.
max_date <- max(hotel_stays$tripend, hotel_stays$tripstart)
min_date <- min(hotel_stays$tripend, hotel_stays$tripstart)

hotel_stays_long <- data.table(day = seq.Date(min_date, to = max_date,, length.out = max_date - min_date))
setkey(hotel_stays_long, day)
foverlaps(hotel_stays, hotel_stays_long)

R code for the data:

hotel_stays <- data.table(tripstart = c(as.Date("2014-10-07"), as.Date("2013-06-12"), as.Date("2013-02-07")), tripend = c(as.Date("2014-10-10"), as.Date("2013-06-13"), as.Date("2013-02-10")), Country = c("US", "FR", "DK")) 

Upvotes: 1

Views: 123

Answers (1)

Stereo
Stereo

Reputation: 1193

Thanks to Frank I have two solutions.

hotel_stays <- data.table(tripstart = c(as.Date("2014-10-07"), as.Date("2013-06-12"), as.Date("2013-02-07")), tripend = c(as.Date("2014-10-10"), as.Date("2013-06-13"), as.Date("2013-02-10")), Country = c("US", "FR", "DK"))

### Solution 1
setkey(hotel_stays, tripstart, tripend)

# Get the first date that was used as transaction date.
max_date <- max(hotel_stays$tripend, hotel_stays$tripstart)
min_date <- min(hotel_stays$tripend, hotel_stays$tripstart)

hotel_stays_long <- data.table(day = seq.Date(min_date, to = max_date,, length.out = max_date - min_date))
hotel_stays_long[, end := day]
setkey(hotel_stays_long, day, end)
hotel_stays_long <- foverlaps(hotel_stays, hotel_stays_long)
hotel_stays_long[, c("end", "tripstart", "tripend") := NULL]

## Solution 2
hotel_stays_long[, .(day = seq(tripstart, tripend, by = "day"), Country), 
                       by = 1 : nrow(hotel_stays_long)] 

I ran both examples on a private data set that contains some additional columns. The info on this data set is,

> dim(hotel_stays)
[1] 4675   28

The first solution leads to

replications elapsed relative user.self sys.self user.child sys.child
1          100   1.898        1     1.889    0.005          0         0

The second solution leads to

 replications elapsed relative user.self sys.self user.child sys.child
1          100  45.244        1    45.253        0          0         0

The test environment is

> sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-unknown-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server release 6.6 (Santiago)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] rbenchmark_1.0.0 data.table_1.9.5 RODBC_1.3-11    

loaded via a namespace (and not attached):
[1] tools_3.2.0  chron_2.3-45

Conclusion, the first solution is much faster but less elegant.

Upvotes: 1

Related Questions