plausibly_exogenous
plausibly_exogenous

Reputation: 512

Breaking up observations across multiple dates in R

I am hoping to find a data.table + lubridate solution to the following problem.

Suppose that I have the following dataset:


library(data.table)
library(lubridate)
library(magrittr)

sample <- data.table(start = c("2018-12-22 23:00:00",
                               "2018-12-23 06:00:00",
                               "2018-12-22 06:00:00",
                               "2018-12-23 06:00:00"),
                     end = c("2018-12-23 06:00:00",
                             "2018-12-23 13:00:00",
                             "2018-12-23 12:00:00",
                             "2018-12-24 01:00:00"),
                     store = c("A", "A", "B", "B"),
                     var = 1:4)

sample[, start:= ymd_hms(start)]
sample[, end := ymd_hms(end)]

The output looks like:


> sample
                 start                 end store var
1: 2018-12-22 23:00:00 2018-12-23 06:00:00     A   1
2: 2018-12-23 06:00:00 2018-12-23 13:00:00     A   2
3: 2018-12-22 06:00:00 2018-12-23 12:00:00     B   3
4: 2018-12-23 06:00:00 2018-12-24 01:00:00     B   4

Notice that on rows 1, 3, and 4, we have observations that span multiple dates. I would like to break up these observations so that they only take up one calendar date each. It is also possible that a start date and end date spans multiple days, but I would like to have one observation for each date. For the sample above, the data.table should look like:

                 start                 end store var
1: 2018-12-22 23:00:00 2018-12-22 23:59:59     A   1
2: 2018-12-23 00:00:00 2018-12-23 06:00:00     A   1
3: 2018-12-23 06:00:00 2018-12-23 13:00:00     A   2
4: 2018-12-22 06:00:00 2018-12-22 23:59:59     B   3
5: 2018-12-23 00:00:00 2018-12-23 12:00:00     B   3
6: 2018-12-23 06:00:00 2018-12-23 23:59:59     B   4
7: 2018-12-24 00:00:00 2018-12-24 01:00:00     B   4

It is important the var variable is the same when we break up the observation across multiple dates.

Thanks!

Upvotes: 2

Views: 97

Answers (2)

r2evans
r2evans

Reputation: 160407

Using a simple helper-function,

library(lubridate)
func <- function(st, en) { 
  days <- seq(floor_date(min(st), unit = "days"), 
              ceiling_date(max(en), unit = "days"),
              by = "1 day")
  days <- c(st, days[-c(1, length(days))], en)
  list(days[-length(days)], days[-1])
}

We get:

library(data.table)
sample[, setNames(func(start, end), c("start", "end")), by = .(store, var)]
#     store   var               start                 end
#    <char> <int>              <POSc>              <POSc>
# 1:      A     1 2018-12-22 23:00:00 2018-12-23 00:00:00
# 2:      A     1 2018-12-23 00:00:00 2018-12-23 06:00:00
# 3:      A     2 2018-12-23 06:00:00 2018-12-23 13:00:00
# 4:      B     3 2018-12-22 06:00:00 2018-12-23 00:00:00
# 5:      B     3 2018-12-23 00:00:00 2018-12-23 12:00:00
# 6:      B     4 2018-12-23 06:00:00 2018-12-24 00:00:00
# 7:      B     4 2018-12-24 00:00:00 2018-12-24 01:00:00

Upvotes: 3

langtang
langtang

Reputation: 24722

# expand the rows
sample = sample[sample[, .(date = seq(as.IDate(start),as.IDate(end),1)), by=var], on="var"]

# fix the times
sample[, `:=`(
  start = fifelse(as.IDate(start) == date,
                  ymd_hms(paste0(as.Date(start),as.ITime(start))),
                  ymd_hms(paste0(date,"00:00:00"))),
  end = fifelse(as.IDate(end) == date,
                ymd_hms(paste0(as.Date(end),as.ITime(end))),
                ymd_hms(paste0(date, "23:59:59"))),
  date = NULL
)]

Output:

                 start                 end store var
1: 2018-12-22 23:00:00 2018-12-22 23:59:59     A   1
2: 2018-12-23 00:00:00 2018-12-23 06:00:00     A   1
3: 2018-12-23 06:00:00 2018-12-23 13:00:00     A   2
4: 2018-12-22 06:00:00 2018-12-22 23:59:59     B   3
5: 2018-12-23 00:00:00 2018-12-23 12:00:00     B   3
6: 2018-12-23 06:00:00 2018-12-23 23:59:59     B   4
7: 2018-12-24 00:00:00 2018-12-24 01:00:00     B   4

Upvotes: 1

Related Questions