NA replace function in R

I'm trying to replace NA in matrix - mat - by zeros. I'm using mat[is.na(mat)] <- 0. When I have matrix of 94531 observations of 18946 variables or smaller it works good but I try it on matrix of 112039 observations of 22752 variables, R shows an error:

Error in if (!nreplace) return(x) : missing value where TRUE/FALSE needed
In addition: Warning message:
In sum(i, na.rm = TRUE) : integer overflow - use sum(as.numeric(.))

I don't know what I'm doing wrong and I don't understand the error.

Here is an example of the structure of my data.

small data.matrix: (made from real data source)

> str(mat)
Classes 'data.table' and 'data.frame':  94531 obs. of  18946 variables:
 $ 6316506: num  1 0 NA NA NA NA NA NA NA NA ...
 $ 6794602: num  0 1 NA NA NA NA NA 0 0 0 ...
 $ 1008667: num  NA NA 0 1 0 NA NA 0 0 0 ...
 $ 6312454: num  NA NA 1 0 0 NA NA 0 0 0 ...
 $ 8009082: num  NA NA 0 0 1 NA NA NA NA NA ...
 $ 1023293: num  NA NA NA NA NA 1 NA NA NA NA ...
 $ 6740421: num  NA NA NA NA NA 1 NA 0 0 0 ...
 $ 6777805: num  NA NA NA NA NA NA 1 NA NA NA ...
 $ 1000558: num  NA NA NA NA NA NA NA 0 0 0 ...
 $ 1001682: num  NA NA NA NA NA NA NA 0 0 0 ...

the bigger looks exactly the same.

Other question:

is there some way how to use rbindlist(data, fill=T) and fill with zeros instead of NAs?

Upvotes: 2

Views: 820

Answers (1)

lmo
lmo

Reputation: 38500

With a large data.table, the set function is usually the way to go for replacement within variables.

In this application, you can get your desired outcome in two steps.

  1. Find the locations of NAs for each variable and return a list.
  2. Use data.table's set function to replace the values.

I constructed a data.table as a reproducible example.

set.seed(1234)
dt <- data.table(matrix(sample(c(NA, rnorm(4)), replace=TRUE, 50), 10))
This looks like
dt
            V1         V2         V3         V4         V5
 1:  1.0844412         NA -2.3456977 -2.3456977 -1.2070657
 2:  0.2774292 -1.2070657         NA -2.3456977  1.0844412
 3:  1.0844412 -1.2070657  0.2774292  0.2774292         NA
 4:  0.2774292 -1.2070657 -1.2070657  1.0844412 -1.2070657
 5: -1.2070657         NA -1.2070657 -1.2070657  1.0844412
 6: -2.3456977         NA  0.2774292  1.0844412  0.2774292
 7: -1.2070657 -1.2070657         NA -1.2070657         NA
 8: -2.3456977 -2.3456977  1.0844412  0.2774292  0.2774292
 9: -1.2070657  0.2774292 -1.2070657  1.0844412  0.2774292
10: -1.2070657 -2.3456977 -1.2070657  0.2774292  1.0844412

The first step is to find the NAs for each column.

myNAs <- lapply(dt, function(x) which(is.na(x)))

Next, use a for loop to iterate over the columns and fill in the NA values with the super efficient set function after checking that the column contains missing values with if.

for(j in seq_along(dt)) if(length(myNAs[[j]]) > 0) set(dt, myNAs[[j]], j, 0)

set performs the replacement "in place" (without any copies), so following this operation, the data.table dt has the former NAs replaced with 0s.

dt
            V1         V2         V3         V4         V5
 1:  1.0844412  0.0000000 -2.3456977 -2.3456977 -1.2070657
 2:  0.2774292 -1.2070657  0.0000000 -2.3456977  1.0844412
 3:  1.0844412 -1.2070657  0.2774292  0.2774292  0.0000000
 4:  0.2774292 -1.2070657 -1.2070657  1.0844412 -1.2070657
 5: -1.2070657  0.0000000 -1.2070657 -1.2070657  1.0844412
 6: -2.3456977  0.0000000  0.2774292  1.0844412  0.2774292
 7: -1.2070657 -1.2070657  0.0000000 -1.2070657  0.0000000
 8: -2.3456977 -2.3456977  1.0844412  0.2774292  0.2774292
 9: -1.2070657  0.2774292 -1.2070657  1.0844412  0.2774292
10: -1.2070657 -2.3456977 -1.2070657  0.2774292  1.0844412

Upvotes: 6

Related Questions