Reputation: 168
I have an exact problem with this question Split overlapping intervals into non-overlapping intervals, within values of an identifier using R data.table, but with an additional issue that my intervals are date time instead of integers. My identifier is the id column. In the original data, I have overlapping time periods, and I want to split them into all non-overlapping periods for each id.
Have:
dt <- data.table(id = c(1, 1, 1),
start = c("1970-01-10 06:01:16", "1970-01-10 12:01:16", "1970-01-10 09:34:49"),
end = c("1970-01-10 12:01:16", "1970-01-10 17:01:16", "1970-01-11 07:49:48"),
value = c(1, 0, 3))
id start end value
1 1970-01-10 06:01:16 1970-01-10 12:01:16 1
1 1970-01-10 12:01:16 1970-01-10 17:01:16 0
1 1970-01-10 09:34:49 1970-01-11 07:49:48 3
Want:
id start end value
1 1970-01-10 06:01:16 1970-01-10 09:34:49 1
1 1970-01-10 09:34:49 1970-01-10 12:01:16 1
1 1970-01-10 12:01:16 1970-01-10 17:01:16 0
1 1970-01-10 09:34:49 1970-01-10 12:01:16 3
1 1970-01-10 12:01:16 1970-01-10 17:01:16 3
1 1970-01-10 17:01:16 1970-01-11 07:49:48 3
Appreciate any help!
Upvotes: 1
Views: 1180
Reputation: 42544
If I understand correctly, the OP wants to split the time interval of each row along the vector of all time instances in the dataset.
I was wondering if it would be possible to find a less convoluted and verbose approach. So, here are three variants:
foverlaps()
- this is my preferred optionlibrary(data.table)
cols <- c("start", "end")
dt[, (cols) := lapply(.SD, as.POSIXct), .SDcols = cols]
dt[, rn := .I]
ts <- dt[, CJ(rn, start = c(start, end), unique = TRUE)]
dt[ts, on = .(rn, start)][
, setnafill(.SD, "locf"), by = .(rn)][
start < end][
, end := shift(start, type = "lead", fill = last(end)), by = .(rn)][]
rn id start end value 1: 1 1 1970-01-10 06:01:16 1970-01-10 09:34:49 1 2: 1 1 1970-01-10 09:34:49 1970-01-10 12:01:16 1 3: 2 1 1970-01-10 12:01:16 1970-01-10 17:01:16 0 4: 3 1 1970-01-10 09:34:49 1970-01-10 12:01:16 3 5: 3 1 1970-01-10 12:01:16 1970-01-10 17:01:16 3 6: 3 1 1970-01-10 17:01:16 1970-01-11 07:49:48 3
CJ()
with all row numbersts
to dt
, i.e., take all common time instancesNA
s by last observation carried forwardend
columnlibrary(data.table)
cols <- c("start", "end")
dt[, (cols) := lapply(.SD, as.POSIXct), .SDcols = cols]
dt[, rn := .I]
ts <- dt[, .(unique(sort(c(start, end))))]
dt[, .SD[ts, on = .(start <= V1, end >= V1), nomatch = NULL, .SD][
, end := shift(start, type = "lead")][-.N], by = rn]
rn id start end value 1: 1 1 1970-01-10 06:01:16 1970-01-10 09:34:49 1 2: 1 1 1970-01-10 09:34:49 1970-01-10 12:01:16 1 3: 2 1 1970-01-10 12:01:16 1970-01-10 17:01:16 0 4: 3 1 1970-01-10 09:34:49 1970-01-10 12:01:16 3 5: 3 1 1970-01-10 12:01:16 1970-01-10 17:01:16 3 6: 3 1 1970-01-10 17:01:16 1970-01-11 07:49:48 3
V1
rn
,
.SD
with the common time instances,end
columnfoverlaps()
library(data.table)
cols <- c("start", "end")
dt[, (cols) := lapply(.SD, as.POSIXct), .SDcols = cols]
ts <- dt[, unique(sort(c(start, end)))]
foverlaps(dt, data.table(start = head(ts, -1L), end = tail(ts, -1L), key = "start,end"))[
start >= i.start & end <= i.end, -c("i.start", "i.end")]
start end id value 1: 1970-01-10 06:01:16 1970-01-10 09:34:49 1 1 2: 1970-01-10 09:34:49 1970-01-10 12:01:16 1 1 3: 1970-01-10 12:01:16 1970-01-10 17:01:16 1 0 4: 1970-01-10 09:34:49 1970-01-10 12:01:16 1 3 5: 1970-01-10 12:01:16 1970-01-10 17:01:16 1 3 6: 1970-01-10 17:01:16 1970-01-11 07:49:48 1 3
foverlaps()
. The second parameter is a keyed data.table which consists of sub-intervals [start, end]
created from the vector of common time instances ts
The second parameter contains all the sub-intervals
data.table(start = head(ts, -1L), end = tail(ts, -1L), key = "start,end")
start end 1: 1970-01-10 06:01:16 1970-01-10 09:34:49 2: 1970-01-10 09:34:49 1970-01-10 12:01:16 3: 1970-01-10 12:01:16 1970-01-10 17:01:16 4: 1970-01-10 17:01:16 1970-01-11 07:49:48
Upvotes: 1
Reputation: 2419
This is a data.table
version answer avoid using for
loop.
dt <- data.table(id = c(1, 1, 1),
start = as.POSIXct(c("1970-01-10 06:01:16", "1970-01-10 12:01:16", "1970-01-10 09:34:49")),
end = as.POSIXct(c("1970-01-10 12:01:16", "1970-01-10 17:01:16", "1970-01-11 07:49:48")),
value = c(1, 0, 3))
time_seq <- dt[,.(start=unique(sort(c(start,end))))]
dt[, `:=`(
start_pos = sapply(start, function(x) which(x == time_seq$start)),
end_pos = sapply(end, function(x) which(x == time_seq$start))
)][,num := end_pos - start_pos]
# same size with expected result
dt_desired <- dt[rep(seq(.N), num)][,order:=rowid(num)]
dt_final <- dt_desired[order == 1,
.(id,
start = time_seq[start_pos:(end_pos - 1), start],
end = time_seq[(start_pos + 1):end_pos, start],
value),
by = num]
Here is my very intuitive solution. The key is to create the unique ordered time series time_seq, which is the join of all start and end. Then to loop all the row to generate start and end from time_seq.
library(data.table)
dt <- data.table(id = c(1, 1, 1),
start = as.POSIXct(c("1970-01-10 06:01:16", "1970-01-10 12:01:16", "1970-01-10 09:34:49")),
end = as.POSIXct(c("1970-01-10 12:01:16", "1970-01-10 17:01:16", "1970-01-11 07:49:48")),
value = c(1, 0, 3))
# initialization
dt_tmp <- data.table()
tem <- data.table()
# unique full time series with order
time_seq <- data.table(start = unique(sort(c(dt$start,dt$end)) ))
for (i in 1:nrow(dt)){
# select date between start and end of the i-th row
tem <- time_seq[dt[i,start] <= start & start <= dt[i,end]]
len <- length(tem$start)
if(len > 2)
tmp <- data.table(id = dt[i,id],
start = tem[1:len-1,start],
end = tem[2:len,start],
value = dt[i,value])
else
tmp <- dt[i,]
dt_tmp <- rbind(dt_tmp,tmp)
}
dt_tmp
Upvotes: 2
Reputation: 2318
Ok, here we go.
We're going to solve your problem in five steps:
lubridate:int_diff()
to turn our ordered list of dates into a list of intervalsBelow, you'll find a short version that does just the above five steps, and then a longer version the plots the results with ggplot
.
library(tidyverse)
library(ggplot2)
# Short version
# I'm using a tibble instead of data.table; hope you don't mind
dt <- tibble(
id = c(1, 1, 1),
start = c("1970-01-10 06:01:16", "1970-01-10 12:01:16", "1970-01-10 09:34:49"),
end = c("1970-01-10 12:01:16", "1970-01-10 17:01:16", "1970-01-11 07:49:48"),
value = c(1, 0, 3)
)
# 1. Create 'interval' column from your original 'start' and 'end'
# 2. Create vector of all start + end dates and sort it
# 3. Using our 'interval' column, store an ordered list of all dates within that interval, per row
# 4. Use lubridate:int_diff() to turn our ordered list of dates into a list of intervals
# 5. Unnest our list-of-interval column into multiple rows: we're done!
# 1. Create 'interval' column
dt <- dt %>%
mutate(
start = lubridate::ymd_hms(start), # convert to Date objects
end = lubridate::ymd_hms(end), # convert to Date objects
date_range = lubridate::interval(start, end)
)
# 2. Create vector of all start + end dates and sort it
all_dates <- dt %>%
select(start, end, value) %>%
pivot_longer(!value, names_to = "date_type", values_to="date") %>%
arrange(date) %>%
select(date) %>%
distinct()
# Steps 3 and 4 in one mutate
dt <- dt %>%
rowwise() %>%
mutate(
# 3. Store an ordered list of all dates within that interval
bounded_dates = list(filter(all_dates, all_dates$date %within% date_range) %>% pull(date)),
# 4. convert list-of-dates column into list-of-intervals
bounded_intervals = list(int_diff(bounded_dates))
)
# 5. Unnest our list-of-interval column into multiple rows: we're done!
new_intervals <- dt %>%
select(value, bounded_intervals) %>%
unnest(bounded_intervals)
new_intervals
This should create:
# A tibble: 6 x 7
value bounded_intervals start
<fct> <Interval> <dttm>
1 1 1970-01-10 06:01:16 UTC--1970-01-10 09:34:49 UTC 1970-01-10 06:01:16
2 1 1970-01-10 09:34:49 UTC--1970-01-10 12:01:16 UTC 1970-01-10 09:34:49
3 0 1970-01-10 12:01:16 UTC--1970-01-10 17:01:16 UTC 1970-01-10 12:01:16
4 3 1970-01-10 09:34:49 UTC--1970-01-10 12:01:16 UTC 1970-01-10 09:34:49
5 3 1970-01-10 12:01:16 UTC--1970-01-10 17:01:16 UTC 1970-01-10 12:01:16
6 3 1970-01-10 17:01:16 UTC--1970-01-11 07:49:48 UTC 1970-01-10 17:01:16
# … with 4 more variables: end <dttm>, row <fct>, y_min <dbl>, y_max <dbl>
This should produce plots that look like:
# Here is the same code as above, but with more steps and some plots to illustrate the results.
dt <- tibble(
id = c(1, 1, 1),
start = c("1970-01-10 06:01:16", "1970-01-10 12:01:16", "1970-01-10 09:34:49"),
end = c("1970-01-10 12:01:16", "1970-01-10 17:01:16", "1970-01-11 07:49:48"),
value = c(1, 0, 3)
)
dt <- dt %>%
mutate(
value = as_factor(value),
start = lubridate::ymd_hms(start), # convert to Date objects
end = lubridate::ymd_hms(end), # convert to Date objects
date_range = lubridate::interval(start, end), # create a 'Date Interval' for use later
row = as_factor(row_number()), # For plotting
y_min = as.integer(value) - 0.4, # For plotting
y_max = as.integer(value) + 0.4, # For plotting
)
# Let's plot the initial data
ggplot(dt, aes(xmin=start, xmax=end, ymin=y_min, ymax = y_max, color=value)) +
geom_rect() +
labs(title = "original intervals")
ggsave("original_intervals.png")
all_dates <- dt %>%
select(start, end, value) %>%
pivot_longer(!value, names_to = "date_type", values_to="date") %>%
arrange(date) %>%
select(date) %>%
distinct()
dt <- dt %>%
rowwise() %>%
mutate(
bounded_dates = list(filter(all_dates, all_dates$date %within% date_range) %>% pull(date)),
bounded_intervals = list(int_diff(bounded_dates))
)
new_intervals <- dt %>%
select(value, bounded_intervals) %>%
unnest(bounded_intervals)
# Add plotting-relevant columns
new_intervals <- new_intervals %>%
mutate(
start = int_start(bounded_intervals),
end = int_end(bounded_intervals),
row = as_factor(row_number()),
y_min = as.integer(value) - 0.4,
y_max = as.integer(value) + 0.4,
value = as_factor(value)
)
ggplot(new_intervals, aes(xmin=start, xmax=end, ymin=y_min, ymax = y_max, color=row)) +
geom_rect() +
labs(title = "new intervals")
ggsave("updated_intervals.png")
Upvotes: 2