rkraft
rkraft

Reputation: 545

Fill incomplete time series in data.table

Question: In data.table is there any way to fill an incomplete date sequence with zeros? For instance, in the toy example some dates does not show up and I want to have a complete date sequence with y = 0 for these cases. Is there something like a forward fill?

Remark: Note I do not want to use merges where you create first the full date sequence and then merge it back to the initial data.table object (I think this is inefficient and rather slow).

library(data.table)

dt <- data.table(
  x = c("2020-03-28", "2020-03-29", "2020-03-31", "2020-04-05"),
  y = c(1, 5, 3, 70)
)

## Output:
            x  y
1: 2020-03-28  1
2: 2020-03-29  5
3: 2020-03-31  3
4: 2020-04-05 70

## Desired Output:
            x  y
1: 2020-03-28  1
2: 2020-03-29  5
3: 2020-03-30  0
4: 2020-03-31  3
5: 2020-04-01  0
6: 2020-04-02  0
7: 2020-04-03  0
8: 2020-04-04  0
9: 2020-04-05 70

Upvotes: 1

Views: 683

Answers (4)

B. Christian Kamgang
B. Christian Kamgang

Reputation: 6489

You could also try this:

dt[, x := as.IDate(x)]
dt[.(seq(min(x), max(x), 1)), .(y = fifelse(is.na(y), 0, y)), .EACHI, on = "x"]

#             x     y
# 1: 2020-03-28     1
# 2: 2020-03-29     5
# 3: 2020-03-30     0
# 4: 2020-03-31     3
# 5: 2020-04-01     0
# 6: 2020-04-02     0
# 7: 2020-04-03     0
# 8: 2020-04-04     0
# 9: 2020-04-05    70

Upvotes: 0

Darren Tsai
Darren Tsai

Reputation: 35554

Use CJ and tidyr::full_seq to create a join data table.

dt[, x := as.Date(x)]                               # convert x to the Date type
dt2 <- dt[CJ(x = tidyr::full_seq(x, 1)), on = .(x)] # create the full sequence
dt2[is.na(y), y := 0]                               # fill NAs with 0s
dt2

#             x  y
# 1: 2020-03-28  1
# 2: 2020-03-29  5
# 3: 2020-03-30  0
# 4: 2020-03-31  3
# 5: 2020-04-01  0
# 6: 2020-04-02  0
# 7: 2020-04-03  0
# 8: 2020-04-04  0
# 9: 2020-04-05 70

Upvotes: 2

MichaelChirico
MichaelChirico

Reputation: 34703

How about this?

# convert to data.table's integer date type
dt[ , x := as.IDate(x)]
# find the range of dates
date_bounds = range(dt$x)
# construct a sequence of all dates
#   NB: this will be integers as attributes are stripped
all_dates = date_bounds[1L]:date_bounds[2L]

# construct a table with the missing dates,
#   with y filled to 0
missing = data.table(
  # as.IDate uses the right origin for integer input
  x = as.IDate(setdiff(all_dates, dt$x)),
  y = 0
)

dt = rbind(dt, missing)
#             x     y
#        <IDat> <num>
# 1: 2020-03-28     1
# 2: 2020-03-29     5
# 3: 2020-03-31     3
# 4: 2020-04-05    70
# 5: 2020-03-30     0
# 6: 2020-04-01     0
# 7: 2020-04-02     0
# 8: 2020-04-03     0
# 9: 2020-04-04     0

Afterwards you can setorder(dt, x) if you want the dates to be in order

Upvotes: 4

Ronak Shah
Ronak Shah

Reputation: 388982

You could use complete from tidyr :

library(dplyr)
library(tidyr)

dt %>%
  mutate(x = as.Date(x)) %>%
  complete(x = seq(min(x), max(x), by = "day"), fill = list(y = 0))

#     x           y
#  <date>     <dbl>
#1 2020-03-28     1
#2 2020-03-29     5
#3 2020-03-30     0
#4 2020-03-31     3
#5 2020-04-01     0
#6 2020-04-02     0
#7 2020-04-03     0
#8 2020-04-04     0
#9 2020-04-05    70

Upvotes: 0

Related Questions