lil_barnacle
lil_barnacle

Reputation: 168

Split overlapping date time intervals into non-overlapping intervals, within values of an identifier

I have an exact problem with this question Split overlapping intervals into non-overlapping intervals, within values of an identifier using R data.table, but with an additional issue that my intervals are date time instead of integers. My identifier is the id column. In the original data, I have overlapping time periods, and I want to split them into all non-overlapping periods for each id.

Have:

dt <- data.table(id = c(1, 1, 1),
                 start = c("1970-01-10 06:01:16", "1970-01-10 12:01:16", "1970-01-10 09:34:49"),
                 end = c("1970-01-10 12:01:16", "1970-01-10 17:01:16", "1970-01-11 07:49:48"),
                 value = c(1, 0, 3))

id  start                end                    value
1   1970-01-10 06:01:16  1970-01-10 12:01:16    1
1   1970-01-10 12:01:16  1970-01-10 17:01:16    0
1   1970-01-10 09:34:49  1970-01-11 07:49:48    3

Want:

id  start                end                    value
1   1970-01-10 06:01:16  1970-01-10 09:34:49    1
1   1970-01-10 09:34:49  1970-01-10 12:01:16    1
1   1970-01-10 12:01:16  1970-01-10 17:01:16    0
1   1970-01-10 09:34:49  1970-01-10 12:01:16    3
1   1970-01-10 12:01:16  1970-01-10 17:01:16    3
1   1970-01-10 17:01:16  1970-01-11 07:49:48    3

Appreciate any help!

Upvotes: 1

Views: 1180

Answers (3)

Uwe
Uwe

Reputation: 42544

If I understand correctly, the OP wants to split the time interval of each row along the vector of all time instances in the dataset.

I was wondering if it would be possible to find a less convoluted and verbose approach. So, here are three variants:

  1. Using join
  2. Using non-equi join within a split
  3. Using foverlaps() - this is my preferred option

Variant 1

library(data.table)
cols <- c("start", "end")
dt[, (cols) := lapply(.SD, as.POSIXct), .SDcols = cols]
dt[, rn := .I]
ts <- dt[, CJ(rn, start = c(start, end), unique = TRUE)]
dt[ts, on = .(rn, start)][
  , setnafill(.SD, "locf"), by = .(rn)][
    start < end][
      , end := shift(start, type = "lead", fill = last(end)), by = .(rn)][]
   rn id               start                 end value
1:  1  1 1970-01-10 06:01:16 1970-01-10 09:34:49     1
2:  1  1 1970-01-10 09:34:49 1970-01-10 12:01:16     1
3:  2  1 1970-01-10 12:01:16 1970-01-10 17:01:16     0
4:  3  1 1970-01-10 09:34:49 1970-01-10 12:01:16     3
5:  3  1 1970-01-10 12:01:16 1970-01-10 17:01:16     3
6:  3  1 1970-01-10 17:01:16 1970-01-11 07:49:48     3

Explanation of variant 1

  • lines 2 & 3: coerce character dates to numeric dates
  • line 4: append row numbers
  • line 5: create vector of common time instances, cross join CJ() with all row numbers
  • line 6: right join ts to dt, i.e., take all common time instances
  • line 7: replace NAs by last observation carried forward
  • line 8: keep only rows which fall within the originally given intervals
  • line 9: copy the end points of the new sub-intervals to the end column

Variant 2

library(data.table)
cols <- c("start", "end")
dt[, (cols) := lapply(.SD, as.POSIXct), .SDcols = cols]
dt[, rn := .I]
ts <- dt[, .(unique(sort(c(start, end))))]
dt[, .SD[ts, on = .(start <= V1, end >= V1), nomatch = NULL, .SD][
  , end := shift(start, type = "lead")][-.N], by = rn]
   rn id               start                 end value
1:  1  1 1970-01-10 06:01:16 1970-01-10 09:34:49     1
2:  1  1 1970-01-10 09:34:49 1970-01-10 12:01:16     1
3:  2  1 1970-01-10 12:01:16 1970-01-10 17:01:16     0
4:  3  1 1970-01-10 09:34:49 1970-01-10 12:01:16     3
5:  3  1 1970-01-10 12:01:16 1970-01-10 17:01:16     3
6:  3  1 1970-01-10 17:01:16 1970-01-11 07:49:48     3

Explanation of variant 2

  • lines 2 to 4: as above
  • line 5: create a data.table with the vector of common time instances V1
  • line 6: For each row rn,
    • perform a non-equi right join of the subset .SD with the common time instances,
      thereby keeping only rows which fall within the originally given intervals
    • line 7: copy the end points of the new sub-intervals to the end column
      and drop the last line of each subset

Variant 3: foverlaps()

library(data.table)
cols <- c("start", "end")
dt[, (cols) := lapply(.SD, as.POSIXct), .SDcols = cols]
ts <- dt[, unique(sort(c(start, end)))]
foverlaps(dt, data.table(start = head(ts, -1L), end = tail(ts, -1L), key = "start,end"))[
  start >= i.start & end <= i.end, -c("i.start", "i.end")]
                 start                 end id value
1: 1970-01-10 06:01:16 1970-01-10 09:34:49  1     1
2: 1970-01-10 09:34:49 1970-01-10 12:01:16  1     1
3: 1970-01-10 12:01:16 1970-01-10 17:01:16  1     0
4: 1970-01-10 09:34:49 1970-01-10 12:01:16  1     3
5: 1970-01-10 12:01:16 1970-01-10 17:01:16  1     3
6: 1970-01-10 17:01:16 1970-01-11 07:49:48  1     3

Explanation of variant 3

  • lines 2 to 4: as above
  • line 5: create vector of common time instances
  • line 6: call foverlaps(). The second parameter is a keyed data.table which consists of sub-intervals [start, end] created from the vector of common time instances ts
  • line 7: keep only rows which fall within the originally given intervals and drop columns which are no longer needed

The second parameter contains all the sub-intervals

data.table(start = head(ts, -1L), end = tail(ts, -1L), key = "start,end")
                 start                 end
1: 1970-01-10 06:01:16 1970-01-10 09:34:49
2: 1970-01-10 09:34:49 1970-01-10 12:01:16
3: 1970-01-10 12:01:16 1970-01-10 17:01:16
4: 1970-01-10 17:01:16 1970-01-11 07:49:48

Upvotes: 1

Peace Wang
Peace Wang

Reputation: 2419

updated answer

This is a data.table version answer avoid using for loop.

dt <- data.table(id = c(1, 1, 1),
                 start = as.POSIXct(c("1970-01-10 06:01:16", "1970-01-10 12:01:16", "1970-01-10 09:34:49")),
                 end = as.POSIXct(c("1970-01-10 12:01:16", "1970-01-10 17:01:16", "1970-01-11 07:49:48")),
                 value = c(1, 0, 3))

time_seq <- dt[,.(start=unique(sort(c(start,end))))]

dt[, `:=`(
  start_pos = sapply(start, function(x) which(x == time_seq$start)),
  end_pos = sapply(end, function(x) which(x == time_seq$start))
)][,num := end_pos - start_pos]

# same size with expected result
dt_desired <- dt[rep(seq(.N), num)][,order:=rowid(num)]


dt_final <- dt_desired[order == 1,
                       .(id,
                         start = time_seq[start_pos:(end_pos - 1), start],
                         end = time_seq[(start_pos + 1):end_pos, start],
                         value),
                       by = num]

original answer

Here is my very intuitive solution. The key is to create the unique ordered time series time_seq, which is the join of all start and end. Then to loop all the row to generate start and end from time_seq.

library(data.table)
dt <- data.table(id = c(1, 1, 1),
                 start = as.POSIXct(c("1970-01-10 06:01:16", "1970-01-10 12:01:16", "1970-01-10 09:34:49")),
                 end = as.POSIXct(c("1970-01-10 12:01:16", "1970-01-10 17:01:16", "1970-01-11 07:49:48")),
                 value = c(1, 0, 3))

# initialization
dt_tmp <- data.table()
tem <- data.table()

# unique full time series with order
time_seq <- data.table(start = unique(sort(c(dt$start,dt$end)) ))

for (i in 1:nrow(dt)){
  # select date between start and end of the i-th row 
  tem <- time_seq[dt[i,start] <= start & start <= dt[i,end]]
  len <- length(tem$start)
  if(len > 2)
    tmp <- data.table(id = dt[i,id],
                    start = tem[1:len-1,start],
                    end = tem[2:len,start],
                    value = dt[i,value])
  else
    tmp <- dt[i,]
  dt_tmp <- rbind(dt_tmp,tmp)
}
dt_tmp

Upvotes: 2

Zachary Cross
Zachary Cross

Reputation: 2318

Ok, here we go.

We're going to solve your problem in five steps:

  1. Create 'interval' column from your original 'start' and 'end'
  2. Create vector of all start + end dates and sort it
  3. Using our 'interval' column, store an ordered list of all dates within that interval, per row
  4. Use lubridate:int_diff() to turn our ordered list of dates into a list of intervals
  5. Unnest our list-of-interval column into multiple rows: we're done!

Below, you'll find a short version that does just the above five steps, and then a longer version the plots the results with ggplot.


Short version

library(tidyverse)
library(ggplot2)

# Short version

# I'm using a tibble instead of data.table; hope you don't mind
dt <- tibble(
  id = c(1, 1, 1),
  start = c("1970-01-10 06:01:16", "1970-01-10 12:01:16", "1970-01-10 09:34:49"),
  end = c("1970-01-10 12:01:16", "1970-01-10 17:01:16", "1970-01-11 07:49:48"),
  value = c(1, 0, 3)
)

# 1. Create 'interval' column from your original 'start' and 'end'
# 2. Create vector of all start + end dates and sort it
# 3. Using our 'interval' column, store an ordered list of all dates within that interval, per row
# 4. Use lubridate:int_diff() to turn our ordered list of dates into a list of intervals
# 5. Unnest our list-of-interval column into multiple rows: we're done!


# 1. Create 'interval' column
dt <- dt %>%
  mutate(
    start = lubridate::ymd_hms(start),  # convert to Date objects
    end = lubridate::ymd_hms(end),  # convert to Date objects
    date_range = lubridate::interval(start, end)
  )

# 2. Create vector of all start + end dates and sort it
all_dates <- dt %>% 
  select(start, end, value) %>%
  pivot_longer(!value, names_to = "date_type", values_to="date") %>%
  arrange(date) %>%
  select(date) %>%
  distinct()

# Steps 3 and 4 in one mutate
dt <- dt %>% 
  rowwise() %>%
  mutate(
    # 3. Store an ordered list of all dates within that interval
    bounded_dates = list(filter(all_dates, all_dates$date %within% date_range) %>% pull(date)),
    # 4. convert list-of-dates column into list-of-intervals
    bounded_intervals = list(int_diff(bounded_dates))
  )

# 5. Unnest our list-of-interval column into multiple rows: we're done!
new_intervals <- dt %>%
  select(value, bounded_intervals) %>%
  unnest(bounded_intervals)

new_intervals

This should create:

# A tibble: 6 x 7
  value bounded_intervals                                start              
  <fct> <Interval>                                       <dttm>             
1 1     1970-01-10 06:01:16 UTC--1970-01-10 09:34:49 UTC 1970-01-10 06:01:16
2 1     1970-01-10 09:34:49 UTC--1970-01-10 12:01:16 UTC 1970-01-10 09:34:49
3 0     1970-01-10 12:01:16 UTC--1970-01-10 17:01:16 UTC 1970-01-10 12:01:16
4 3     1970-01-10 09:34:49 UTC--1970-01-10 12:01:16 UTC 1970-01-10 09:34:49
5 3     1970-01-10 12:01:16 UTC--1970-01-10 17:01:16 UTC 1970-01-10 12:01:16
6 3     1970-01-10 17:01:16 UTC--1970-01-11 07:49:48 UTC 1970-01-10 17:01:16
# … with 4 more variables: end <dttm>, row <fct>, y_min <dbl>, y_max <dbl>

Longer version

This should produce plots that look like: Original Intervales

Updated intevals

# Here is the same code as above, but with more steps and some plots to illustrate the results.

dt <- tibble(
    id = c(1, 1, 1),
    start = c("1970-01-10 06:01:16", "1970-01-10 12:01:16", "1970-01-10 09:34:49"),
    end = c("1970-01-10 12:01:16", "1970-01-10 17:01:16", "1970-01-11 07:49:48"),
    value = c(1, 0, 3)
  ) 

dt <- dt %>%
  mutate(
    value = as_factor(value),
    start = lubridate::ymd_hms(start),  # convert to Date objects
    end = lubridate::ymd_hms(end),  # convert to Date objects
    date_range = lubridate::interval(start, end),  # create a 'Date Interval' for use later
    row = as_factor(row_number()),  # For plotting
    y_min = as.integer(value) - 0.4,  # For plotting
    y_max = as.integer(value) + 0.4,  # For plotting
  )

# Let's plot the initial data
ggplot(dt, aes(xmin=start, xmax=end, ymin=y_min, ymax = y_max, color=value)) + 
  geom_rect() + 
  labs(title = "original intervals")

ggsave("original_intervals.png")

all_dates <- dt %>% 
  select(start, end, value) %>%
  pivot_longer(!value, names_to = "date_type", values_to="date") %>%
  arrange(date) %>%
  select(date) %>%
  distinct()

dt <- dt %>% 
  rowwise() %>%
  mutate(
    bounded_dates = list(filter(all_dates, all_dates$date %within% date_range) %>% pull(date)),
    bounded_intervals = list(int_diff(bounded_dates))
)

new_intervals <- dt %>%
  select(value, bounded_intervals) %>%
  unnest(bounded_intervals)

# Add plotting-relevant columns
new_intervals <- new_intervals %>%
  mutate(
    start = int_start(bounded_intervals),
    end = int_end(bounded_intervals),
    row = as_factor(row_number()),
    y_min = as.integer(value) - 0.4,
    y_max = as.integer(value) + 0.4,
    value = as_factor(value)
  )

ggplot(new_intervals, aes(xmin=start, xmax=end, ymin=y_min, ymax = y_max, color=row)) + 
  geom_rect() + 
  labs(title = "new intervals")

ggsave("updated_intervals.png")

Upvotes: 2

Related Questions