Reputation: 3554

Replacing "NA" (NA string) with NA inplace data.table

I have this dummy dataset:

abc <- data.table(a = c("NA", "bc", "x"), b = c(1, 2, 3), c = c("n", "NA", "NA"))

where I am trying to replace "NA" with standard NA; in place using data.table. I tried:

for(i in names(abc)) (abc[which(abc[[i]] == "NA"), i := NA])
for(i in names(abc)) (abc[which(abc[[i]] == "NA"), i := NA_character_])
for(i in names(abc)) (set(abc, which(abc[[i]] == "NA"), i, NA))

However still with this I get:

abc$a 
"NA" "bc" "x"

What am I missing?

EDIT: I tried @frank answer in this question which makes use of type.convert(). (Thanks frank; didn't know such obscure albeit useful function) In documentation of type.convert() it is mentioned: "This is principally a helper function for read.table." so I wanted to test this thoroughly. This function comes with small side effect when you have a complete column filled with "NA" (NA string). In such case type.convert() is converting column to logical. For such case abc will be:

abc <- data.table(a = c("NA", "bc", "x"), b = c(1, 2, 3), c = c("n", "NA", "NA"), d = c("NA", "NA", "NA"))

EDIT2: To summerize code present in original question:

for(i in names(abc)) (set(abc, which(abc[[i]] == "NA"), i, NA))

works fine but only in current latest version of data.table (> 1.11.4). So if one is facing this problem then its better to update data.table and use this code than type.convert()

Upvotes: 0

Answers (3)

Uwe

Reputation: 42592

Here are two other approaches:

Subsetting

library(data.table)
abcd <- data.table(a = c("NA", "bc", "x"), b = c(1, 2, 3),
                   c = c("n", "NA", "NA"), d = c("NA", "NA", "NA"))

for (col in names(abcd)) abcd[get(col) == "NA", (col) := NA]
abcd[]

      a b    c    d
1: <NA> 1    n <NA>
2:   bc 2 <NA> <NA>
3:    x 3 <NA> <NA>

Update while joining

Here, data.table is rather strict concerning variable type.

abcd <- data.table(a = c("NA", "bc", "x"), b = c(1, 2, 3),
                   c = c("n", "NA", "NA"), d = c("NA", "NA", "NA"))

for (col in names(abcd)) 
  if (is.character(abcd[[col]])) 
    abcd[.("NA", NA_character_), on = paste0(col, "==V1"), (col) := V2][]
abcd

      a b    c    d
1: <NA> 1    n <NA>
2:   bc 2 <NA> <NA>
3:    x 3 <NA> <NA>

Upvotes: 0

Frank

Reputation: 66819

I'd do...

chcols = names(abc)[sapply(abc, is.character)]
abc[, (chcols) := lapply(.SD, type.convert, as.is=TRUE), .SDcols=chcols]

which yields

> str(abc)
Classes ‘data.table’ and 'data.frame':  3 obs. of  3 variables:
 $ a: chr  NA "bc" "x"
 $ b: num  1 2 3
 $ c: chr  "n" NA NA
 - attr(*, ".internal.selfref")=<externalptr>

Your DT[, i :=] code did not work because it creates a column literally named "i"; and your set code does work already, as @AdamSampson pointed out. (Note: OP upgraded from data.table 1.10.4-3 to 1.11.4 before this was the case on their comp.)

so I wanted to test this thoroughly. This function comes with small side effect when you have a complete column filled with "NA" (NA string). In such case type.convert() is converting column to logical.

Oh right. Your original approach is safer against this problem:

# op's new example
abc <- data.table(a = c("NA", "bc", "x"), b = c(1, 2, 3), c = c("n", "NA", "NA"), d = c("NA", "NA", "NA"))

# op's original code
for(i in names(abc)) 
  set(abc, which(abc[[i]] == "NA"), i, NA)

Side note: NA has type logical; and usually data.table would warn when assigning values of an incongruent type to a column, but I guess they wrote in an exception for NAs:

DT = data.table(x = 1:2)
DT[1, x := NA]
# no problem, even though x is int and NA is logi

DT = data.table(x = 1:2)
DT[1, x := TRUE]
# Warning message:
# In `[.data.table`(DT, 1, `:=`(x, TRUE)) :
#   Coerced 'logical' RHS to 'integer' to match the column's type. Either change the target column ['x'] to 'logical' first (by creating a new 'logical' vector length 2 (nrows of entire table) and assign that; i.e. 'replace' column), or coerce RHS to 'integer' (e.g. 1L, NA_[real|integer]_, as.*, etc) to make your intent clear and for speed. Or, set the column type correctly up front when you create the table and stick to it, please.

Upvotes: 5

Adam Sampson

Reputation: 2021

I really liked Frank's response, but want to add to it because it assumes you're only performing the change for character vectors. I'm also going to try to include some info on "why" it works.

To replace all NA you could do something like:

chcols = names(abc)
abc[,(chcols) := lapply(.SD, function(x) ifelse(x == "NA",NA,x)),.SDcols = chcols]

Let's breakdown what we are doing here.

We are looking at every row in abc (because there is nothing before the first comma).

After the next comma is the columns. Let's break that down.

We are putting the results into all of the columns listed in chcols. The (chcols) tells the data.table method to evaluate the vector of names held in the chcols object. If you left off the parentheses and used chcols it would try to store the results in a column called chcols instead of using the column names you want.

.SD is returning a data.table with the results of every column listed in .SDcols (in my case it is returning all columns...). But we want to evaluate a single column at a time. So we use lapply to apply a function to every column in .SD one at a time.

You can use any function that will return the correct values. Frank used type.convert. I'm using an anonymous function that evaluates an ifelse statement. I used ifelse because it evaluates and returns an entire vector/column.

You already know how to use a := to replace values in place.

After the next column you either put the by information or you put additional options. We will add additional options in the form of .SDcols.

We need to put a .SDcols = chcols to tell data.table which columns to include in .SD. My code is evaluating all columns, so if you left off .SDcols my code would still work. But it's a bad habit to leave this column off because you can lose time in the future if you make a change to only evaluate certain columns. Frank's example only evaluated columns that were of the character class for instance.

Upvotes: 1

Replacing &quot;NA&quot; (NA string) with NA inplace data.table

Answers (3)

Subsetting

Update while joining

Related Questions

Replacing "NA" (NA string) with NA inplace data.table