Reputation: 21675
I have a data.table like so
dt <- data.table(x=as.Date(c("2014-1-1", "2015-1-1", "2016-1-1")), y=as.Date(c(NA, "2015-6-1", NA)))
dt
x y
1: 2014-01-01 <NA>
2: 2015-01-01 2015-06-01
3: 2016-01-01 <NA>
I want to add a column z
which is equal to y where y is not NA, and x otherwise.
dt[, z:=ifelse(is.na(y), x, y)]
dt
x y z
1: 2014-01-01 <NA> 16071
2: 2015-01-01 2015-06-01 16587
3: 2016-01-01 <NA> 16801
But for some reason the above statement casts z to numeric. If I try to convert it to a date with as.Date
I get an error
dt[, z:=as.Date(ifelse(is.na(y), x, y))]
Error in as.Date.numeric(ifelse(is.na(y), x, y)) : 'origin' must be supplied
What gives and how do I accomplish what I'm trying to do?
Upvotes: 1
Views: 27447
Reputation: 42574
This old question has been viewed over ten thousand times now.
Although it has an accepted answer I feel the question deserves
data.table
solution, Date
fails with ifelse()
and replace()
approach returns the wrong results.data.table
approachWith data.table
, ifelse()
and replace()
can be written as two chained assignment operations where the second one uses subsetting:
dt[, z := y][is.na(z), z := x][]
x y z 1: 2014-01-01 <NA> 2014-01-01 2: 2015-01-01 2015-06-01 2015-06-01 3: 2016-01-01 <NA> 2016-01-01
The first assignment operation creates a new column z
by copying the y
column. The second assignment operation modifies z
in place by copying the contents of the x
column only to those rows where z
is NA
.
Alternatively, we can take a copy of x
first and the replace the z
values with the non-NA
y
values:
dt <- copy(dt_orig) # use a fresh copy of dt
dt[, z := x][!is.na(y), z := y][]
The latter might be more efficient if there are many NA
values in y
.
replace()
approachesFrank has suggested to use replace()
instead of ifelse()
which was picked up by C8H10N4O2 in an edit of his answer. Unfortunately, both codes not only generate warnings but simply return the wrong result:
dt <- copy(dt_orig) # use a fresh copy of dt
# C8H10N4O2's version
dt[, z := replace(y, is.na(y), x)][]
dt <- copy(dt_orig) # use a fresh copy of dt
# Frank's version
dt[, z := replace(y, which(is.na(y)), x)][]
x y z 1: 2014-01-01 <NA> 2014-01-01 2: 2015-01-01 2015-06-01 2015-06-01 3: 2016-01-01 <NA> 2015-01-01 Warning message: In NextMethod(.Generic) : number of items to replace is not a multiple of replacement length
The value of z
in row 3 has been copied from x
in row 2 which is wrong. Instead, it should have been copied from row 3.
What has happened here? The help page on replace(x, list, values)
says
replace
replaces the values inx
with indices given inlist
by those given invalues
.
In our example, list
gets the row indices 1, 3
while values
gets 2014-01-01, 2015-01-01, 2016-01-01
. The different lengths is the reason for the warning message. And it is obvious that the second index in list
which is row 3 is replaced by the second value in values
which is 2015-01-01
.
The correct use of replace()
requires to subset x
as well:
dt <- copy(dt_orig) # use a fresh copy of dt
dt[, z := replace(y, is.na(y), x[is.na(y)])][]
which yields
x y z 1: 2014-01-01 <NA> 2014-01-01 2: 2015-01-01 2015-06-01 2015-06-01 3: 2016-01-01 <NA> 2016-01-01
without any warning.
Date
fails with ifelse()
The help page on ifelse(test, yes, no)
has a long Warning section which starts
The mode of the result may depend on the value of
test
[...], and the class attribute [...] of the result is taken fromtest
and may be inappropriate for the values selected fromyes
andno
.Sometimes it is better to use a construction such as
(tmp <- yes; tmp[!test] <- no[!test]; tmp)
Applying this suggestion to our example
dt <- copy(dt_orig) # use a fresh copy of dt
dt[, z := {tmp <- x; tmp[!is.na(y)] <- y[!is.na(y)]; tmp}][]
we do get
x y z 1: 2014-01-01 <NA> 2014-01-01 2: 2015-01-01 2015-06-01 2015-06-01 3: 2016-01-01 <NA> 2016-01-01
library(data.table) # version 1.11.4 used
dt_orig <-data.table(x = as.Date(c("2014-1-1", "2015-1-1", "2016-1-1")),
y = as.Date(c(NA, "2015-6-1", NA)))
As there are 5 different approaches available now, I was wondering what the fastest method is. The run time may depend on the number of rows but also on the share of NA
values in y
.
So, the press()
function from the bench
package is used to investigate the impact of the two parameters on benchmarks results.
bm <- bench::press(
n_rows = c(100, 1E4, 1E6),
na_share = c(0.1, 0.5, 0.9),
{
dt_bm <- data.table(x = as.Date("1970-01-01") + seq_len(n_rows),
y = as.Date("2970-01-01") + seq_len(n_rows))
set.seed(1L)
dt_bm[sample(seq_len(n_rows), na_share * n_rows), y := NA]
bench::mark(
ifelse = copy(dt_bm)[, z := as.Date(ifelse(is.na(y), x, y), origin="1970-01-01")][],
replace = copy(dt_bm)[, z := replace(y, is.na(y), x[is.na(y)])][],
tmp = copy(dt_bm)[, z := {tmp <- x; tmp[!is.na(y)] <- y[!is.na(y)]; tmp}][],
copy_y = copy(dt_bm)[, z := y][is.na(z), z := x][],
copy_x = copy(dt_bm)[, z := x][!is.na(y), z := y][]
)
}
)
library(ggplot2)
autoplot(bm) + theme_bw()
Upvotes: 8
Reputation: 19015
When R looks at dates as integers, its origin is January 1, 1970.
https://stats.idre.ucla.edu/r/faq/how-does-r-handle-date-values/
dt[, z:=as.Date(ifelse(is.na(y), x, y), origin="1970-01-01")]
update: as Frank suggests, this also seems to work and does not seem to require un-coercion: dt[, z:=replace(y, is.na(y), x)]
. It throws a warning so use w/ caution.
Upvotes: 4
Reputation: 5169
dt[, z:=as.Date(ifelse(is.na(y), x, y),origin="1970-01-01")]
dt
x y z
1: 2014-01-01 <NA> 2014-01-01
2: 2015-01-01 2015-06-01 2015-06-01
3: 2016-01-01 <NA> 2016-01-01
Upvotes: 2