Reputation: 1163
I have a table with a lot of colums and I want to remove columns having more than 500 missing values.
I already know the number of missing values per column with :
library(fields)
t(stats(mm))
I got :
N mean Std.Dev. min Q1 median Q3 max missing values
V1 1600 8.67 … 400
Some columns exhibit NA for all the characteristics :
N mean Std.Dev. min Q1 median Q3 max missing values
V50 NA NA NA NA NA NA
I also want to remove these kind of columns.
Upvotes: 7
Views: 14383
Reputation: 1
m is the matrix that you are working with. this creates a vector, wntg (stands for which needs to go) that lists the columns which have the sum number of NA values greater than 500
The conditions of this comparison can be easily modified to fit your needs
Then make a new matrix I call mr (stands for m reduced) where you have removed the columns defined by the vector, wntg
In this simple example I have done the case where you want to exclude columns with more than 2 NAs
wntg<-which(colSums(is.na(m))>2)
mr<-m[,-c(wntg)]
> m<-matrix(c(1,2,3,4,NA,NA,7,8,9,NA,NA,NA), nrow=4, ncol =3)
> m
[,1] [,2] [,3]
[1,] 1 NA 9
[2,] 2 NA NA
[3,] 3 7 NA
[4,] 4 8 NA
> wntg<-which(colSums(is.na(m))>2)
> wntg
[1] 3
> mr<-m[,-c(wntg)]
> mr
[,1] [,2]
[1,] 1 NA
[2,] 2 NA
[3,] 3 7
[4,] 4 8
Upvotes: 0
Reputation: 846
Another potential solution (works especially well with dataframes):
data[,!sapply(data,function(x) any(is.na(x)))]
Upvotes: 5
Reputation: 3287
rem = NULL
for(col.nr in 1:dim(data)[2]){
if(sum(is.na(data[, col.nr]) > 500 | all(is.na(data[,col.nr])))){
rem = c(rem, col.nr)
}
}
data[, -rem]
Upvotes: 1
Reputation: 11946
If you store the results of the stats call like this:
tmpres<-t(stats(mm))
You can do something like:
whichcolsneedtogo<-apply(tmpres, 1, function(currow){all(is.na(currow)) || (currow["missing values"] > 500)})
Finally:
mmclean<-mm[!whichcolsneedtogo]
Of course this is untested, as you have not provided data to reproduce your example.
Upvotes: 5