Reputation: 1095

Replace NULL with NA in r data.table with lists

I have a dataset imported from a MongoDb database as a data.table, where some of the columns are formated as lists and contain some NULL values. The NULL values were causing me some issues when trying to fill a column in another data.table by reference to the first table, as the destination column was not in list format (and therefore can't have NULL values).

I found a solution below, which works fine for now, but my test dataset is only 6 records and I'm wondering if this would struggle when working with larger datasets or if there is a more efficient way to do this (in data.table)?

Here is some example data:

library(data.table)
dt <- data.table(id = c(1,2,3), age = list(12, NULL, 15), sex = list("F", "M", NULL))

And here is the solution I applied:

# Function to change NULL to NA in a data.table with lists:
null2na <- function(dtcol){
  nowna = lapply(dtcol, function(x) ifelse(is.null(x), NA_real_, x))
  return(nowna)
}

# Apply the function to the data.table to replace NULLs with NAs:
dt[, c(names(dt)) := lapply(.SD, null2na), .SDcols = names(dt)]

Upvotes: 0

Answers (3)

Amy M

Reputation: 1095

My toy example is too small to compare timings, but combining both solutions suggested by @B. Christian Kamgang and @Ronak Shah works well for me:

# Function to replace NULL with NA in lists:
null2na <- function(dtcol){
  fullcol = replace(dtcol, lengths(dtcol) == 0L, NA)
  return(fullcol)

# Apply function to dataset:
dt[, names(dt) := lapply(.SD, null2na)]

Two things I found advantageous with this approach (thanks to both respondants for suggesting):

Avoiding use of base r ifelse, dplyr::if_else and data.table::fifelse; base r ifelse converts all columns to a list unless you specify them before-hand, and the dplyr and data.table versions of ifelse, while they respect the original column classes don't work in this scenario because NA is interpreted as differing in type from the other values in the list.
The use of the function lengths(dtcol) == 0L targets specifically only the list elements that are null and doesn't do anything to the other columns or values. This means that it is not necessary to specify the subset of columns that are lists before-hand, as inherently it deals only with those.
I've gone with replace() rather than subsetting dtcol in the function as I think with larger datasets the former might be slightly faster (but have yet to test that).

Upvotes: 0

B. Christian Kamgang

Reputation: 6529

Here another way to solve your problem:

cols <- names(dt)[sapply(dt, is.list)]   # get names of list columns 

dt[, (cols) := lapply(.SD, function(x) replace(x, lengths(x)==0L, NA)), .SDcols=cols]

Upvotes: 1

Ronak Shah

Reputation: 389355

You can save one lapply call by using the lengths function.

library(data.table)

null2na <- function(dtcol){
  dtcol[lengths(dtcol) == 0] <- NA
  return(dtcol)
}

dt[, names(dt) := lapply(.SD, null2na)]
dt
#   id age sex
#1:  1  12   F
#2:  2  NA   M
#3:  3  15  NA

The age and sex column are still lists. If you want them as a simple vector return unlist(dtcol) from the function.

Upvotes: 2

Replace NULL with NA in r data.table with lists

Answers (3)

Related Questions