Reputation: 21675

Strange behavior involving dates - "origin must be supplied"

I have a data.table like so

dt <- data.table(x=as.Date(c("2014-1-1", "2015-1-1", "2016-1-1")), y=as.Date(c(NA, "2015-6-1", NA)))
dt
            x          y
1: 2014-01-01       <NA>
2: 2015-01-01 2015-06-01
3: 2016-01-01       <NA>

I want to add a column z which is equal to y where y is not NA, and x otherwise.

dt[, z:=ifelse(is.na(y), x, y)]
dt
            x          y     z
1: 2014-01-01       <NA> 16071
2: 2015-01-01 2015-06-01 16587
3: 2016-01-01       <NA> 16801

But for some reason the above statement casts z to numeric. If I try to convert it to a date with as.Date I get an error

dt[, z:=as.Date(ifelse(is.na(y), x, y))]
Error in as.Date.numeric(ifelse(is.na(y), x, y)) : 'origin' must be supplied

What gives and how do I accomplish what I'm trying to do?

Upvotes: 1

Answers (3)

Uwe

Reputation: 42574

This old question has been viewed over ten thousand times now.

Although it has an accepted answer I feel the question deserves

a genuine data.table solution,
an explanation why Date fails with ifelse() and
why the replace() approach returns the wrong results.

`data.table` approach

With data.table, ifelse() and replace() can be written as two chained assignment operations where the second one uses subsetting:

dt[, z := y][is.na(z), z := x][]

            x          y          z
1: 2014-01-01       <NA> 2014-01-01
2: 2015-01-01 2015-06-01 2015-06-01
3: 2016-01-01       <NA> 2016-01-01

The first assignment operation creates a new column z by copying the y column. The second assignment operation modifies z in place by copying the contents of the x column only to those rows where z is NA.

Alternatively, we can take a copy of x first and the replace the z values with the non-NA y values:

dt <- copy(dt_orig)   # use a fresh copy of dt
dt[, z := x][!is.na(y), z := y][]

The latter might be more efficient if there are many NA values in y.

Errors in the `replace()` approaches

Frank has suggested to use replace() instead of ifelse() which was picked up by C8H10N4O2 in an edit of his answer. Unfortunately, both codes not only generate warnings but simply return the wrong result:

dt <- copy(dt_orig)   # use a fresh copy of dt
# C8H10N4O2's version 
dt[, z := replace(y, is.na(y), x)][]

dt <- copy(dt_orig)   # use a fresh copy of dt
# Frank's version
dt[, z := replace(y, which(is.na(y)), x)][]

            x          y          z
1: 2014-01-01       <NA> 2014-01-01
2: 2015-01-01 2015-06-01 2015-06-01
3: 2016-01-01       <NA> 2015-01-01
Warning message:
In NextMethod(.Generic) :
  number of items to replace is not a multiple of replacement length

The value of z in row 3 has been copied from x in row 2 which is wrong. Instead, it should have been copied from row 3.

What has happened here? The help page on replace(x, list, values) says

replace replaces the values in x with indices given in list by those given in values.

In our example, list gets the row indices 1, 3 while values gets 2014-01-01, 2015-01-01, 2016-01-01. The different lengths is the reason for the warning message. And it is obvious that the second index in list which is row 3 is replaced by the second value in values which is 2015-01-01.

The correct use of replace() requires to subset x as well:

dt <- copy(dt_orig)   # use a fresh copy of dt
dt[, z := replace(y, is.na(y), x[is.na(y)])][]

which yields

            x          y          z
1: 2014-01-01       <NA> 2014-01-01
2: 2015-01-01 2015-06-01 2015-06-01
3: 2016-01-01       <NA> 2016-01-01

without any warning.

Why `Date` fails with `ifelse()`

The help page on ifelse(test, yes, no) has a long Warning section which starts

The mode of the result may depend on the value of test [...], and the class attribute [...] of the result is taken from test and may be inappropriate for the values selected from yes and no.

Sometimes it is better to use a construction such as
(tmp <- yes; tmp[!test] <- no[!test]; tmp)

Applying this suggestion to our example

dt <- copy(dt_orig)   # use a fresh copy of dt
dt[, z := {tmp <- x; tmp[!is.na(y)] <- y[!is.na(y)]; tmp}][]

we do get

            x          y          z
1: 2014-01-01       <NA> 2014-01-01
2: 2015-01-01 2015-06-01 2015-06-01
3: 2016-01-01       <NA> 2016-01-01

Data

library(data.table)   # version 1.11.4 used
dt_orig <-data.table(x = as.Date(c("2014-1-1", "2015-1-1", "2016-1-1")), 
                y = as.Date(c(NA, "2015-6-1", NA)))

Benchmark

As there are 5 different approaches available now, I was wondering what the fastest method is. The run time may depend on the number of rows but also on the share of NA values in y.

So, the press() function from the bench package is used to investigate the impact of the two parameters on benchmarks results.

bm <- bench::press(
  n_rows = c(100, 1E4, 1E6),
  na_share = c(0.1, 0.5, 0.9),
  {
    dt_bm <- data.table(x = as.Date("1970-01-01") + seq_len(n_rows),
                        y = as.Date("2970-01-01") + seq_len(n_rows))
    set.seed(1L)
    dt_bm[sample(seq_len(n_rows), na_share * n_rows), y := NA]
    bench::mark(
      ifelse = copy(dt_bm)[, z := as.Date(ifelse(is.na(y), x, y), origin="1970-01-01")][],
      replace = copy(dt_bm)[, z := replace(y, is.na(y), x[is.na(y)])][],
      tmp = copy(dt_bm)[, z := {tmp <- x; tmp[!is.na(y)] <- y[!is.na(y)]; tmp}][],
      copy_y = copy(dt_bm)[, z := y][is.na(z), z := x][],
      copy_x = copy(dt_bm)[, z := x][!is.na(y), z := y][]
    )
  }
)

library(ggplot2)
autoplot(bm) + theme_bw()

Upvotes: 8

C8H10N4O2

Reputation: 19015

When R looks at dates as integers, its origin is January 1, 1970.

https://stats.idre.ucla.edu/r/faq/how-does-r-handle-date-values/

dt[, z:=as.Date(ifelse(is.na(y), x, y), origin="1970-01-01")]

update: as Frank suggests, this also seems to work and does not seem to require un-coercion: dt[, z:=replace(y, is.na(y), x)]. It throws a warning so use w/ caution.

Upvotes: 4

Alexey Ferapontov

Reputation: 5169

dt[, z:=as.Date(ifelse(is.na(y), x, y),origin="1970-01-01")]
dt
            x          y          z
1: 2014-01-01       <NA> 2014-01-01
2: 2015-01-01 2015-06-01 2015-06-01
3: 2016-01-01       <NA> 2016-01-01

Upvotes: 2

Strange behavior involving dates - &quot;origin must be supplied&quot;

Answers (3)

data.table approach

Errors in the replace() approaches

Why Date fails with ifelse()