Reputation: 177
I have a data table like this :
timestamp Status
05-01-2020 0
06-01-2020 0
07-01-2020 1
08-01-2020 1
09-01-2020 1
11-01-2020 0
13-01-2020 1
If The status is 1
and status 0
are appearing on two different days, then I need to fill the missing dates. Here on 9th last value of status is 1
and it is becoming 0
on 11th only. So in between I have 10th. I need to add these dates to the existing data table or create a new data table and put status as 1
I am aware of this :
library(tidyverse)
complete(dt, status, timestamp)
Expected output:
timestamp Status
05-01-2020 0
06-01-2020 0
07-01-2020 1
08-01-2020 1
09-01-2020 1
10-01-2020 1
11-01-2020 0
13-01-2020 1
This should repeat for any number of days in between. But only for conditions between 1 and 0
and not between 0 and 1
Upvotes: 0
Views: 565
Reputation: 42544
This is an interesting question. If I understand correctly, the OP asks to insert additional rows between each group of contiguous values of Status
if the previuos group had Status == 1
and the current group has Status == 0
. In addtion, I understand that filling up missing dates within a streak of Status == 1
is not requested.
So here are two different data.table
approaches:
Status == 0
groupThis solution borrows from Matt Dowle's answer to Get the last row of a previous group in data.table (see here for another use case).
It creates groups of contiguous streaks of 0/1 values in Status
(using rleid()
). For each group, it is checked whether it is required to insert rows. If so, the additional rows are prepended to the rows of the current group (using rbind()
).
library(data.table)
options(datatable.print.class = TRUE)
dt[, timestamp := as.IDate(timestamp, "%d-%m-%Y")] # coerce character date to numeric
dt[, grp := rleid(Status)] # create groups of consecutive values of Status
dt[, new := ""] # just for test & demonstration
pg <- first(dt) # initialise storage of last row of previous group
dt[, {
if (first(timestamp) - pg$timestamp > 1L & pg$Status == 1L) {
# if there is a gap and Status switches from 1 to 0 the fill the gap
add <- .(timestamp = seq(pg$timestamp + 1L, first(timestamp) - 1L, by = 1L), Status = 1L, new = "*")
} else {
# no gap to fill
add <- .SD[0L]
}
pg <- last(.SD) # remember last row
rbind(add, .SD) # prepend additional rows
}, by = grp][, grp := NULL][] # remove grouping variable
timestamp Status new <IDat> <int> <char> 1: 2020-01-05 0 2: 2020-01-06 0 3: 2020-01-07 1 4: 2020-01-08 1 5: 2020-01-09 1 6: 2020-01-10 1 * 7: 2020-01-11 0 8: 2020-01-13 1 9: 2020-01-14 0 10: 2020-01-16 1 11: 2020-01-17 1 12: 2020-01-18 1 * 13: 2020-01-19 1 * 14: 2020-01-20 0
Note that an enhanced dataset has been used (see below) to allow for a more thoroughly testing. Also, the column new
has been added just to demonstrate where the rows have been inserted.
This approach is different. It identifies the gaps which are to be filled, creates the missing rows, appends them to the original dataset, and re-orders the rows by timestamp:
library(data.table)
options(datatable.print.class = TRUE)
library(magrittr) # piping used to improve readability
dt[, timestamp := as.IDate(timestamp, "%d-%m-%Y")] # coerce character date to numeric
lapply(
dt[, .I[timestamp - shift(timestamp, fill = first(timestamp)) > 1L & shift(Status) == 1 & Status == 0]],
function(i) dt[, .(timestamp = seq(timestamp[i - 1L] + 1L, timestamp[i] - 1L, by = 1L), Status = 1L)]
) %>%
c(list(dt)) %>%
rbindlist() %>%
.[order(timestamp)]
timestamp Status <IDat> <int> 1: 2020-01-05 0 2: 2020-01-06 0 3: 2020-01-07 1 4: 2020-01-08 1 5: 2020-01-09 1 6: 2020-01-10 1 7: 2020-01-11 0 8: 2020-01-13 1 9: 2020-01-14 0 10: 2020-01-16 1 11: 2020-01-17 1 12: 2020-01-18 1 13: 2020-01-19 1 14: 2020-01-20 0
The expression
dt[, .I[timestamp - shift(timestamp, fill = first(timestamp)) > 1L & shift(Status) == 1 & Status == 0]]
identifies the gaps to be filled by returning the indices in the original dataset dt
where the additional rows need to be inserted before.
[1] 6 11
So, the additional rows need to be inserted between rows 5 to 6, and 10 to 11, resp.
The dataset has been expanded to allow for a more thoroughly testing.
dt <- fread(
"timestamp Status
05-01-2020 0
06-01-2020 0
07-01-2020 1
08-01-2020 1
09-01-2020 1
11-01-2020 0
13-01-2020 1
14-01-2020 0
16-01-2020 1
17-01-2020 1
20-01-2020 0")
Note that all solutions posted so far assume that dt
is ordered by increasing timestamp
. If not, it can be achieved by
setorder(dt, timestamp)
Upvotes: 3
Reputation: 12699
Added some more rows to your data to include the case of more than one missing day.
library(tidyr)
library(dplyr)
library(lubridate)
df %>%
mutate(timestamp = as.Date(timestamp, format = "%d-%m-%Y"),
to_fill = case_when(Status == 1L & lead(Status) == 0L & difftime(lead(timestamp), timestamp, "days") > 1 ~ 1,
TRUE ~ 0)) %>%
complete(timestamp = seq.Date(min(timestamp), max(timestamp), by = "day")) %>%
fill(to_fill) %>%
mutate(Status = case_when(is.na(Status) & to_fill == 1 ~ 1L,
TRUE ~ Status)) %>%
na.omit() %>%
select(-to_fill)
#> # A tibble: 14 x 2
#> timestamp Status
#> <date> <int>
#> 1 2020-01-05 0
#> 2 2020-01-06 0
#> 3 2020-01-07 1
#> 4 2020-01-08 1
#> 5 2020-01-09 1
#> 6 2020-01-10 1
#> 7 2020-01-11 0
#> 8 2020-01-13 1
#> 9 2020-01-15 1
#> 10 2020-01-16 1
#> 11 2020-01-17 1
#> 12 2020-01-18 0
#> 13 2020-01-19 0
#> 14 2020-01-22 1
df <- data.frame(timestamp = c("05-01-2020", "06-01-2020", "07-01-2020", "08-01-2020", "09-01-2020", "11-01-2020", "13-01-2020", "15-01-2020", "18-01-2020", "19-01-2020", "22-01-2020"),
Status = c(0L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 0L, 1L ))
Created on 2020-05-20 by the reprex package (v0.3.0)
Upvotes: 0
Reputation: 25225
An option using rolling join to find the date before the closing zero and then fill up the missing dates for each continuous set of ones:
DT[Status==1L, nextzero :=
DT[Status==0L][.SD, on=.(timestamp), roll=-Inf, x.timestamp - 1L]
]
ans <- rbindlist(list(
DT[Status==1L & !is.na(nextzero),
.(timestamp=seq(min(timestamp), nextzero, by="1 day"), Status=1L),
nextzero],
DT[Status==0L | is.na(nextzero)]
), use.names=TRUE)[, nextzero := NULL]
setorder(ans, timestamp)[]
data:
library(data.table)
DT <- fread("timestamp Status
05-01-2020 0
06-01-20200
07-01-2020 1
08-01-2020 1
09-01-2020 1
11-01-2020 0
13-01-2020 1")
DT[, timestamp := as.IDate(timestamp, "%d-%m-%Y")]
Upvotes: 2
Reputation: 72919
You could create a temporary data frame which is complete in sense of dates and which Status
column is 1
.
dat$timestamp <- as.Date(dat$timestamp, format="%d-%m-%Y") ## date format is needed
tmp <- data.frame(timestamp=seq(dat$timestamp[1], by="day", length.out=nrow(dat)),
Status=1)
Then use match
to rbind
that row where the lagged diff
is -1
.
dat <-
rbind(dat,
tmp[match(dat$timestamp[match(-1, c(diff(dat$Status), NA))] + 1, tmp$timestamp), ])
dat[order(dat$timestamp), ]
# timestamp Status
# 1 2020-01-05 0
# 2 2020-01-06 0
# 3 2020-01-07 1
# 4 2020-01-08 1
# 5 2020-01-09 1
# 61 2020-01-10 1
# 6 2020-01-11 0
# 7 2020-01-13 1
Data
dat <- read.table(text="timestamp Status
05-01-2020 0
06-01-2020 0
07-01-2020 1
08-01-2020 1
09-01-2020 1
11-01-2020 0
13-01-2020 1", header=T)
Upvotes: 0
Reputation: 388982
We can filter the rows which we want to expand. The condition to select rows is if current row Status
is 1 and next row Status
is 0 OR current row Status
is 1 and previous row's Status
is 0.
library(dplyr)
df$timestamp <- as.Date(df$timestamp, '%d-%m-%Y')
temp <- df %>%
filter(Status == 1 & lead(Status) == 0 | lag(Status) == 1 & Status == 0)
Then create groups of two rows in that dataframe and expand them filling the dates between them and updating Status
as 1. Once we have expanded dataset we can bind it with the original dataset to get complete dataset.
temp %>%
group_by(grp = rep(1:n(), each = 2, length.out = n())) %>%
tidyr::complete(timestamp = seq(min(timestamp), max(timestamp), by = 'day'),
fill = list(Status = 1)) %>%
ungroup %>%
select(-grp) %>%
bind_rows(anti_join(df, temp)) %>%
arrange(timestamp)
# A tibble: 8 x 2
# timestamp Status
# <date> <dbl>
#1 2020-01-05 0
#2 2020-01-06 0
#3 2020-01-07 1
#4 2020-01-08 1
#5 2020-01-09 1
#6 2020-01-10 1
#7 2020-01-11 0
#8 2020-01-13 1
Upvotes: 0