Boxuan
Boxuan

Reputation: 5167

R data.table replace "NULL" with `NA` when columns are factors

I pulled some data from a SQL database through ODBC and the columns are automatically set to factor. It is something like the following:

library(RODBC)
library(data.table)
data <- data.table(sqlQuery(channel, query))

My data looks like this, just with a lot more columns:

data <- data.table("C1"=as.factor(c(letters[1:4], "NULL", letters[5])),
                   "C2"=as.factor(c(rnorm(3), "NULL", rnorm(2))),
                   "C3"=as.factor(c(letters[1], "NULL", letters[2:4], "NULL")))
> data
     C1                 C2   C3
1:    a -0.190200079604691    a
2:    b  0.310548914832963 NULL
3:    c 0.0153099116493453    b
4:    d               NULL    c
5: NULL  0.157187027626419    d
6:    e  0.118537540781528 NULL
> str(data)
Classes ‘data.table’ and 'data.frame':  6 obs. of  3 variables:
 $ C1: Factor w/ 6 levels "a","b","c","d",..: 1 2 3 4 6 5
 $ C2: Factor w/ 6 levels "-0.190200079604691",..: 1 5 2 6 4 3
 $ C3: Factor w/ 5 levels "a","b","c","d",..: 1 5 2 3 4 5
 - attr(*, ".internal.selfref")=<externalptr> 

How can I replace the "NULL" with NA? Here I want R to treat these SQL "NULL" strings as missing values NA. I tried the following, but seems NA causes problems.

for (col in names(data)) {
  set(data, which(data[[col]]=="NULL"), col, NA)
}

> Error in set(data, which(data[[col]] == "NULL"), col, NA) : 
  Can't assign to column 'C1' (type 'factor') a value of type 'logical' (not character, factor, integer or numeric)

RODBC Solution

Thanks to @user20650's suggestion, you can control missing values from sqlQuery by doing data <- data.table(sqlQuery(channel, query, na.strings=c("NA", "NULL"))). However, it is still possible to have this problem if your data source is formatted incorrectly, so this is not an universal solution to the post.

Upvotes: 9

Views: 17682

Answers (2)

IRTFM
IRTFM

Reputation: 263499

This has the desired effect and is much more compact:

is.na(data) <- data == "NULL"

Note re: comment Q: the is.na function is quite different than the is.na<- function. The latter one used here is an assignment of a value of NA to items that are defined by the logical expression on hte RHS of the assigment operator. There is an is.na.data.frame-method but not an is.na[<-.dataframe-method. So not really sure if this is a purely by-reference strategy, since it is not implemented with the [.data.frame syntax. It may be using "is.na<-.default".

I think after noodling around a bit more that "is.na<-.default" (which is just {x[value] <- NA; x} ) so will end up dispatching this call to [<-.data.table so it probably will be done "by reference".

Upvotes: 16

Frank
Frank

Reputation: 66819

Here's one way:

data[,names(data):=lapply(.SD,function(x){
  z <- levels(x)
  z[z=="NULL"] <- NA
  `levels<-`(x,z)
})]

To see what happened, look at lapply(data,levels) and you'll see that "NULL" is gone.


(Thanks, @akrun:) A more concise and intuitive variant is available using the car package:

library(car)
data[,names(data):=lapply(.SD, recode, '"NULL"=NA')]

In the data.table world, it's usually possible to modify by reference. In this case, that looks like...

for (j in names(data)) setattr(data[[j]],"levels",{
  z <- levels(data[[j]])
  z[z=="NULL"] <- NA
  z
})

This avoids making a copy of the whole vector, as `levels<-` does.

Upvotes: 5

Related Questions