Reputation: 359
I'm trying to replace NA in matrix - mat
- by zeros. I'm using mat[is.na(mat)] <- 0
. When I have matrix of 94531 observations of 18946 variables or smaller it works good but I try it on matrix of 112039 observations of 22752 variables, R shows an error:
Error in if (!nreplace) return(x) : missing value where TRUE/FALSE needed
In addition: Warning message:
In sum(i, na.rm = TRUE) : integer overflow - use sum(as.numeric(.))
I don't know what I'm doing wrong and I don't understand the error.
Here is an example of the structure of my data.
small data.matrix: (made from real data source)
> str(mat)
Classes 'data.table' and 'data.frame': 94531 obs. of 18946 variables:
$ 6316506: num 1 0 NA NA NA NA NA NA NA NA ...
$ 6794602: num 0 1 NA NA NA NA NA 0 0 0 ...
$ 1008667: num NA NA 0 1 0 NA NA 0 0 0 ...
$ 6312454: num NA NA 1 0 0 NA NA 0 0 0 ...
$ 8009082: num NA NA 0 0 1 NA NA NA NA NA ...
$ 1023293: num NA NA NA NA NA 1 NA NA NA NA ...
$ 6740421: num NA NA NA NA NA 1 NA 0 0 0 ...
$ 6777805: num NA NA NA NA NA NA 1 NA NA NA ...
$ 1000558: num NA NA NA NA NA NA NA 0 0 0 ...
$ 1001682: num NA NA NA NA NA NA NA 0 0 0 ...
the bigger looks exactly the same.
Other question:
is there some way how to use rbindlist(data, fill=T)
and fill with zeros instead of NAs?
Upvotes: 2
Views: 820
Reputation: 38500
With a large data.table, the set
function is usually the way to go for replacement within variables.
In this application, you can get your desired outcome in two steps.
set
function to replace the values.I constructed a data.table as a reproducible example.
set.seed(1234)
dt <- data.table(matrix(sample(c(NA, rnorm(4)), replace=TRUE, 50), 10))
This looks like
dt
V1 V2 V3 V4 V5
1: 1.0844412 NA -2.3456977 -2.3456977 -1.2070657
2: 0.2774292 -1.2070657 NA -2.3456977 1.0844412
3: 1.0844412 -1.2070657 0.2774292 0.2774292 NA
4: 0.2774292 -1.2070657 -1.2070657 1.0844412 -1.2070657
5: -1.2070657 NA -1.2070657 -1.2070657 1.0844412
6: -2.3456977 NA 0.2774292 1.0844412 0.2774292
7: -1.2070657 -1.2070657 NA -1.2070657 NA
8: -2.3456977 -2.3456977 1.0844412 0.2774292 0.2774292
9: -1.2070657 0.2774292 -1.2070657 1.0844412 0.2774292
10: -1.2070657 -2.3456977 -1.2070657 0.2774292 1.0844412
The first step is to find the NAs for each column.
myNAs <- lapply(dt, function(x) which(is.na(x)))
Next, use a for
loop to iterate over the columns and fill in the NA values with the super efficient set
function after checking that the column contains missing values with if
.
for(j in seq_along(dt)) if(length(myNAs[[j]]) > 0) set(dt, myNAs[[j]], j, 0)
set
performs the replacement "in place" (without any copies), so following this operation, the data.table dt has the former NAs replaced with 0s.
dt
V1 V2 V3 V4 V5
1: 1.0844412 0.0000000 -2.3456977 -2.3456977 -1.2070657
2: 0.2774292 -1.2070657 0.0000000 -2.3456977 1.0844412
3: 1.0844412 -1.2070657 0.2774292 0.2774292 0.0000000
4: 0.2774292 -1.2070657 -1.2070657 1.0844412 -1.2070657
5: -1.2070657 0.0000000 -1.2070657 -1.2070657 1.0844412
6: -2.3456977 0.0000000 0.2774292 1.0844412 0.2774292
7: -1.2070657 -1.2070657 0.0000000 -1.2070657 0.0000000
8: -2.3456977 -2.3456977 1.0844412 0.2774292 0.2774292
9: -1.2070657 0.2774292 -1.2070657 1.0844412 0.2774292
10: -1.2070657 -2.3456977 -1.2070657 0.2774292 1.0844412
Upvotes: 6