VJ D
VJ D

Reputation: 177

adding missing dates into a data table based on some conditions in R

I have a data table like this :

timestamp    Status
05-01-2020    0
06-01-2020    0
07-01-2020    1
08-01-2020    1
09-01-2020    1
11-01-2020    0
13-01-2020    1

If The status is 1 and status 0 are appearing on two different days, then I need to fill the missing dates. Here on 9th last value of status is 1 and it is becoming 0 on 11th only. So in between I have 10th. I need to add these dates to the existing data table or create a new data table and put status as 1

I am aware of this :

library(tidyverse)

complete(dt, status, timestamp)

Expected output:

 timestamp    Status
    05-01-2020    0
    06-01-2020    0
    07-01-2020    1
    08-01-2020    1
    09-01-2020    1
    10-01-2020    1
    11-01-2020    0
    13-01-2020    1

This should repeat for any number of days in between. But only for conditions between 1 and 0 and not between 0 and 1

Upvotes: 0

Views: 565

Answers (5)

Uwe
Uwe

Reputation: 42544

This is an interesting question. If I understand correctly, the OP asks to insert additional rows between each group of contiguous values of Status if the previuos group had Status == 1 and the current group has Status == 0. In addtion, I understand that filling up missing dates within a streak of Status == 1 is not requested.

So here are two different data.table approaches:

1. Grouping and prepending additional rows to each Status == 0 group

This solution borrows from Matt Dowle's answer to Get the last row of a previous group in data.table (see here for another use case).

It creates groups of contiguous streaks of 0/1 values in Status (using rleid()). For each group, it is checked whether it is required to insert rows. If so, the additional rows are prepended to the rows of the current group (using rbind()).

library(data.table)
options(datatable.print.class = TRUE)
dt[, timestamp := as.IDate(timestamp, "%d-%m-%Y")]   # coerce character date to numeric 
dt[, grp := rleid(Status)]   # create groups of consecutive values of Status
dt[, new := ""]   # just for test & demonstration
pg <- first(dt)  # initialise storage of last row of previous group
dt[, {
  if (first(timestamp) - pg$timestamp > 1L & pg$Status == 1L) {
    # if there is a gap and Status switches from 1 to 0 the fill the gap
    add <- .(timestamp = seq(pg$timestamp + 1L, first(timestamp) - 1L, by = 1L), Status = 1L, new = "*")
  } else {
    # no gap to fill
    add <- .SD[0L]
  }
  pg <- last(.SD)   # remember last row
  rbind(add, .SD)   # prepend additional rows
}, by = grp][, grp := NULL][]   # remove grouping variable
     timestamp Status    new
        <IDat>  <int> <char>
 1: 2020-01-05      0       
 2: 2020-01-06      0       
 3: 2020-01-07      1       
 4: 2020-01-08      1       
 5: 2020-01-09      1       
 6: 2020-01-10      1      *
 7: 2020-01-11      0       
 8: 2020-01-13      1       
 9: 2020-01-14      0       
10: 2020-01-16      1       
11: 2020-01-17      1       
12: 2020-01-18      1      *
13: 2020-01-19      1      *
14: 2020-01-20      0

Note that an enhanced dataset has been used (see below) to allow for a more thoroughly testing. Also, the column new has been added just to demonstrate where the rows have been inserted.

2. Identify gaps, create missing rows, append and re-order

This approach is different. It identifies the gaps which are to be filled, creates the missing rows, appends them to the original dataset, and re-orders the rows by timestamp:

library(data.table)
options(datatable.print.class = TRUE)
library(magrittr)   # piping used to improve readability
dt[, timestamp := as.IDate(timestamp, "%d-%m-%Y")] # coerce character date to numeric
lapply(
  dt[, .I[timestamp - shift(timestamp, fill = first(timestamp)) > 1L & shift(Status) == 1 & Status == 0]], 
  function(i) dt[, .(timestamp = seq(timestamp[i - 1L] + 1L, timestamp[i] - 1L, by = 1L), Status = 1L)]
) %>% 
  c(list(dt)) %>% 
  rbindlist() %>% 
  .[order(timestamp)]
     timestamp Status
        <IDat>  <int>
 1: 2020-01-05      0
 2: 2020-01-06      0
 3: 2020-01-07      1
 4: 2020-01-08      1
 5: 2020-01-09      1
 6: 2020-01-10      1
 7: 2020-01-11      0
 8: 2020-01-13      1
 9: 2020-01-14      0
10: 2020-01-16      1
11: 2020-01-17      1
12: 2020-01-18      1
13: 2020-01-19      1
14: 2020-01-20      0

The expression

dt[, .I[timestamp - shift(timestamp, fill = first(timestamp)) > 1L & shift(Status) == 1 & Status == 0]]

identifies the gaps to be filled by returning the indices in the original dataset dt where the additional rows need to be inserted before.

[1]  6 11

So, the additional rows need to be inserted between rows 5 to 6, and 10 to 11, resp.

3. Data

The dataset has been expanded to allow for a more thoroughly testing.

dt <- fread(
  "timestamp    Status
05-01-2020    0
06-01-2020    0
07-01-2020    1
08-01-2020    1
09-01-2020    1
11-01-2020    0
13-01-2020    1
14-01-2020    0
16-01-2020    1
17-01-2020    1
20-01-2020    0")

Note that all solutions posted so far assume that dt is ordered by increasing timestamp. If not, it can be achieved by

setorder(dt, timestamp)

Upvotes: 3

Peter
Peter

Reputation: 12699

Added some more rows to your data to include the case of more than one missing day.


library(tidyr)
library(dplyr)
library(lubridate)

  df %>%
    mutate(timestamp = as.Date(timestamp, format = "%d-%m-%Y"),
         to_fill = case_when(Status == 1L & lead(Status) == 0L & difftime(lead(timestamp), timestamp, "days") > 1 ~ 1,
                            TRUE ~ 0)) %>%
    complete(timestamp = seq.Date(min(timestamp), max(timestamp), by = "day")) %>%
    fill(to_fill) %>%
    mutate(Status = case_when(is.na(Status) & to_fill == 1 ~ 1L,
                            TRUE ~ Status)) %>%
    na.omit() %>% 
    select(-to_fill)

#> # A tibble: 14 x 2
#>    timestamp  Status
#>    <date>      <int>
#>  1 2020-01-05      0
#>  2 2020-01-06      0
#>  3 2020-01-07      1
#>  4 2020-01-08      1
#>  5 2020-01-09      1
#>  6 2020-01-10      1
#>  7 2020-01-11      0
#>  8 2020-01-13      1
#>  9 2020-01-15      1
#> 10 2020-01-16      1
#> 11 2020-01-17      1
#> 12 2020-01-18      0
#> 13 2020-01-19      0
#> 14 2020-01-22      1

data

df <- data.frame(timestamp = c("05-01-2020", "06-01-2020", "07-01-2020", "08-01-2020", "09-01-2020", "11-01-2020", "13-01-2020", "15-01-2020", "18-01-2020", "19-01-2020", "22-01-2020"),
                 Status = c(0L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 0L, 1L ))

Created on 2020-05-20 by the reprex package (v0.3.0)

Upvotes: 0

chinsoon12
chinsoon12

Reputation: 25225

An option using rolling join to find the date before the closing zero and then fill up the missing dates for each continuous set of ones:

DT[Status==1L, nextzero := 
    DT[Status==0L][.SD, on=.(timestamp), roll=-Inf, x.timestamp - 1L]
]

ans <- rbindlist(list(
        DT[Status==1L & !is.na(nextzero), 
            .(timestamp=seq(min(timestamp), nextzero, by="1 day"), Status=1L),
            nextzero],
        DT[Status==0L | is.na(nextzero)]
    ), use.names=TRUE)[, nextzero := NULL]

setorder(ans, timestamp)[]

data:

library(data.table)
DT <- fread("timestamp    Status
05-01-2020    0
06-01-20200    
07-01-2020    1
08-01-2020    1
09-01-2020    1
11-01-2020    0
13-01-2020    1")
DT[, timestamp := as.IDate(timestamp, "%d-%m-%Y")]

Upvotes: 2

jay.sf
jay.sf

Reputation: 72919

You could create a temporary data frame which is complete in sense of dates and which Status column is 1.

dat$timestamp <- as.Date(dat$timestamp, format="%d-%m-%Y")  ## date format is needed
tmp <- data.frame(timestamp=seq(dat$timestamp[1], by="day", length.out=nrow(dat)),
           Status=1)

Then use match to rbind that row where the lagged diff is -1.

dat <-
  rbind(dat, 
        tmp[match(dat$timestamp[match(-1, c(diff(dat$Status), NA))] + 1, tmp$timestamp), ])
dat[order(dat$timestamp), ]  
#     timestamp Status
# 1  2020-01-05      0
# 2  2020-01-06      0
# 3  2020-01-07      1
# 4  2020-01-08      1
# 5  2020-01-09      1
# 61 2020-01-10      1
# 6  2020-01-11      0
# 7  2020-01-13      1

Data

dat <- read.table(text="timestamp    Status
05-01-2020    0
06-01-2020    0
07-01-2020    1
08-01-2020    1
09-01-2020    1
11-01-2020    0
13-01-2020    1", header=T)

Upvotes: 0

Ronak Shah
Ronak Shah

Reputation: 388982

We can filter the rows which we want to expand. The condition to select rows is if current row Status is 1 and next row Status is 0 OR current row Status is 1 and previous row's Status is 0.

library(dplyr)
df$timestamp <- as.Date(df$timestamp, '%d-%m-%Y')

temp <- df %>% 
        filter(Status == 1 & lead(Status) == 0 | lag(Status) == 1 & Status == 0)

Then create groups of two rows in that dataframe and expand them filling the dates between them and updating Status as 1. Once we have expanded dataset we can bind it with the original dataset to get complete dataset.

temp %>%  
   group_by(grp = rep(1:n(), each = 2, length.out = n())) %>%
   tidyr::complete(timestamp = seq(min(timestamp), max(timestamp), by = 'day'), 
                   fill = list(Status = 1)) %>%
   ungroup %>%
   select(-grp) %>%
   bind_rows(anti_join(df, temp)) %>%
   arrange(timestamp)


# A tibble: 8 x 2
#  timestamp  Status
#  <date>      <dbl>
#1 2020-01-05      0
#2 2020-01-06      0
#3 2020-01-07      1
#4 2020-01-08      1
#5 2020-01-09      1
#6 2020-01-10      1
#7 2020-01-11      0
#8 2020-01-13      1

Upvotes: 0

Related Questions