Reputation: 5167
I pulled some data from a SQL database through ODBC and the columns are automatically set to factor
. It is something like the following:
library(RODBC)
library(data.table)
data <- data.table(sqlQuery(channel, query))
My data looks like this, just with a lot more columns:
data <- data.table("C1"=as.factor(c(letters[1:4], "NULL", letters[5])),
"C2"=as.factor(c(rnorm(3), "NULL", rnorm(2))),
"C3"=as.factor(c(letters[1], "NULL", letters[2:4], "NULL")))
> data
C1 C2 C3
1: a -0.190200079604691 a
2: b 0.310548914832963 NULL
3: c 0.0153099116493453 b
4: d NULL c
5: NULL 0.157187027626419 d
6: e 0.118537540781528 NULL
> str(data)
Classes ‘data.table’ and 'data.frame': 6 obs. of 3 variables:
$ C1: Factor w/ 6 levels "a","b","c","d",..: 1 2 3 4 6 5
$ C2: Factor w/ 6 levels "-0.190200079604691",..: 1 5 2 6 4 3
$ C3: Factor w/ 5 levels "a","b","c","d",..: 1 5 2 3 4 5
- attr(*, ".internal.selfref")=<externalptr>
How can I replace the "NULL" with NA
? Here I want R
to treat these SQL "NULL" strings as missing values NA
. I tried the following, but seems NA
causes problems.
for (col in names(data)) {
set(data, which(data[[col]]=="NULL"), col, NA)
}
> Error in set(data, which(data[[col]] == "NULL"), col, NA) :
Can't assign to column 'C1' (type 'factor') a value of type 'logical' (not character, factor, integer or numeric)
RODBC Solution
Thanks to @user20650's suggestion, you can control missing values from sqlQuery
by doing data <- data.table(sqlQuery(channel, query, na.strings=c("NA", "NULL")))
. However, it is still possible to have this problem if your data source is formatted incorrectly, so this is not an universal solution to the post.
Upvotes: 9
Views: 17682
Reputation: 263499
This has the desired effect and is much more compact:
is.na(data) <- data == "NULL"
Note re: comment Q: the is.na
function is quite different than the is.na<-
function. The latter one used here is an assignment of a value of NA to items that are defined by the logical expression on hte RHS of the assigment operator. There is an is.na.data.frame
-method but not an is.na[<-.dataframe
-method. So not really sure if this is a purely by-reference strategy, since it is not implemented with the [.data.frame
syntax. It may be using "is.na<-.default".
I think after noodling around a bit more that "is.na<-.default" (which is just {x[value] <- NA; x}
) so will end up dispatching this call to [<-.data.table
so it probably will be done "by reference".
Upvotes: 16
Reputation: 66819
Here's one way:
data[,names(data):=lapply(.SD,function(x){
z <- levels(x)
z[z=="NULL"] <- NA
`levels<-`(x,z)
})]
To see what happened, look at lapply(data,levels)
and you'll see that "NULL"
is gone.
(Thanks, @akrun:) A more concise and intuitive variant is available using the car
package:
library(car)
data[,names(data):=lapply(.SD, recode, '"NULL"=NA')]
In the data.table
world, it's usually possible to modify by reference. In this case, that looks like...
for (j in names(data)) setattr(data[[j]],"levels",{
z <- levels(data[[j]])
z[z=="NULL"] <- NA
z
})
This avoids making a copy of the whole vector, as `levels<-`
does.
Upvotes: 5