Ryan
Ryan

Reputation: 1068

Replacing values by index with data.table syntax

assume we have data.table d1 with 6 rows:

d1 <- data.table(v1 = c(1,2,3,4,5,6), v2 = c(5,5,5,5,5,5))

we add a column to d1 called test, and fill it with NA

d1$test <- NA

the external vector rows gives the index of rows we want to fill with values contained in vals

rows <- c(5,6)
vals <- c(6,3)

how do you do this in data table syntax? i have not been able to figure this out from the documentation.

it seems like this should work, but it does not:

d1[rows, test := vals]

the following error is returned: Warning: 6.000000 (type 'double') at RHS position 1 taken as TRUE when assigning to type 'logical' (column 3 named 'test')

This is my desired outcome:

data.table(v1 = c(1,2,3,4,5,6), v2 = c(5,5,5,5,5,5), test = c(NA,NA,NA,NA,6,3))

Upvotes: 0

Views: 656

Answers (1)

r2evans
r2evans

Reputation: 160447

Let's walk through this:

d1 <- data.table(v1 = c(1,2,3,4,5,6), v2 = c(5,5,5,5,5,5))
d1$test <- NA
rows <- c(5,6)
vals <- c(6,3)
d1[rows, test := vals]
# Warning in `[.data.table`(d1, rows, `:=`(test, vals)) :
#   6.000000 (type 'double') at RHS position 1 taken as TRUE when assigning to type 'logical' (column 3 named 'test')
class(d1$test)
# [1] "logical"
class(vals)
# [1] "numeric"

R can be quite "sloppy" in general, allowing one to coerce values from one class to another. Typically, this is from integer to floating point, sometimes from number to string, sometimes logical to number, etc. R does this freely, at times unexpectedly, and often silently. For instance,

13 > "2"
# [1] FALSE

The LHS is of class numeric, the RHS character. Because of the different classes, R silently converts 13 to "13" and then does the comparison. In this case, a string-comparison is doing a lexicographic comparison, which is letter-by-letter, meaning that it first compares the "1" with the "2", determines that it is unambiguously not true, and stops the comparison (since no other letter will change the results). The fact that the numeric comparison of the two is different, nor the fact that the RHS has no more letters to compare (lengths themselves are not compared) do not matter.

So R can be quite sloppy about this; not all languages are this allowing (most are not, in my experience), and this can be risky in unsupervised (automated) situations. It often produces unexpected results. Because of this, many (including devs of data.table and dplyr, to name two) "encourage" (force) the user to be explicit about class coersion.

As a side note: R has at least 8 different classes of NA, and all of them look like NA:

str(list(NA, NA_integer_, NA_real_, NA_character_, NA_complex_, 
         Sys.Date()[NA], Sys.time()[NA], as.POSIXlt(Sys.time())[NA]))
# List of 8
#  $ : logi NA
#  $ : int NA
#  $ : num NA
#  $ : chr NA
#  $ : cplx NA
#  $ : Date[1:1], format: NA
#  $ : POSIXct[1:1], format: NA
#  $ : POSIXlt[1:1], format: NA

There are a few ways to fix that warning.

  1. Instantiate the test column as a "real" (numeric, floating-point) version of NA:

    # starting with a fresh `d1` without `test` defined
    d1$test <- NA_real_
    d1[rows, test := vals]  # works, no warning
    
  2. Instantiate the test column programmatically, matching the class of vals without using the literal NA_real_:

    # starting with a fresh `d1` without `test` defined
    d1$test <- vals[1][NA]
    d1[rows, test := vals]  # works, no warning
    
  3. Convert the existing test column in its entirety (not subsetted) to the desired class:

    d1$test <- NA                   # this one is class logical
    d1[, test := as.numeric(test)]  # converts from NA to NA_real_
    d1[rows, test := vals]          # works, no warning
    

Things that work but are still being sloppy:

  1. replace allows us to do this, but it is silently internally coercing from logical to numeric:

    d1$test <- NA  # logical class
    d1[, test := replace(test, .I %in% rows, vals)]
    

    This works because the internals of replace are simple:

    function (x, list, values) 
    {
        x[list] <- values
        x
    }
    

    The reassignment to x[list] causes R to coerce the entire vector from logical to numeric, and it returns the whole vector at once. In data.table, assigning to the whole column at once allows this, since it is a common operation to change the class of a column. As a side note, some might be tempted to use replace to fix things here. Using base::ifelse, this works, but further demonstrates the sloppiness of R here (and more so in ifelse, which while convenient, it is broken in a few ways).

  2. base::ifelse doesn't work here out of the box because we'd need vals to be the same length as number of rows in d1. Even if that were the case, though, ifelse also silently coerces the class of one or the other. Imagine these scenarios:

    ifelse(c(TRUE, TRUE), pi, "pi")
    # [1] 3.141593 3.141593
    ifelse(c(TRUE, FALSE), pi, "pi")
    # [1] "3.14159265358979" "pi"              
    

    The moment one of the conditions is false in this case, the whole result changes from numeric to character, and there was no message or warning to that effect. It is because of this that data.table::fifelse (and dplyr::if_else) will fail preemptively:

    fifelse(c(TRUE, TRUE), pi, "pi")
    # Error in fifelse(c(TRUE, TRUE), pi, "pi") : 
    #   'yes' is of type double but 'no' is of type character. Please make sure that both arguments have the same type.
    

    (There are other issues with ifelse, not just this, caveat emptor.)

Upvotes: 1

Related Questions