Delphine
Delphine

Reputation: 1163

Removing columns with missing values

I have a table with a lot of colums and I want to remove columns having more than 500 missing values.

I already know the number of missing values per column with :

library(fields)
t(stats(mm))

I got :

  N     mean  Std.Dev.    min       Q1  median       Q3 max missing values
V1 1600 8.67  …                                               400

Some columns exhibit NA for all the characteristics :

      N     mean  Std.Dev.    min       Q1  median       Q3 max missing values
 V50  NA    NA      NA         NA        NA                   NA

I also want to remove these kind of columns.

Upvotes: 7

Views: 14383

Answers (5)

FlyDr
FlyDr

Reputation: 1

m is the matrix that you are working with. this creates a vector, wntg (stands for which needs to go) that lists the columns which have the sum number of NA values greater than 500

The conditions of this comparison can be easily modified to fit your needs

Then make a new matrix I call mr (stands for m reduced) where you have removed the columns defined by the vector, wntg

In this simple example I have done the case where you want to exclude columns with more than 2 NAs

wntg<-which(colSums(is.na(m))>2)

mr<-m[,-c(wntg)]

> m<-matrix(c(1,2,3,4,NA,NA,7,8,9,NA,NA,NA), nrow=4, ncol =3)
> m
     [,1] [,2] [,3]
[1,]    1   NA    9
[2,]    2   NA   NA
[3,]    3    7   NA
[4,]    4    8   NA
> wntg<-which(colSums(is.na(m))>2)
> wntg
[1] 3
> mr<-m[,-c(wntg)]
> mr
     [,1] [,2]
[1,]    1   NA
[2,]    2   NA
[3,]    3    7
[4,]    4    8

Upvotes: 0

chandler
chandler

Reputation: 846

Another potential solution (works especially well with dataframes):

data[,!sapply(data,function(x) any(is.na(x)))]

Upvotes: 5

Ramnath
Ramnath

Reputation: 55695

Here is a one liner to do it mm[colSums(is.na(mm)) > 500]

Upvotes: 10

pvoosten
pvoosten

Reputation: 3287

rem = NULL
for(col.nr in 1:dim(data)[2]){
    if(sum(is.na(data[, col.nr]) > 500 | all(is.na(data[,col.nr])))){
        rem = c(rem, col.nr)
    }
}
data[, -rem]

Upvotes: 1

Nick Sabbe
Nick Sabbe

Reputation: 11946

If you store the results of the stats call like this:

tmpres<-t(stats(mm))

You can do something like:

whichcolsneedtogo<-apply(tmpres, 1, function(currow){all(is.na(currow)) || (currow["missing values"] > 500)})

Finally:

mmclean<-mm[!whichcolsneedtogo]

Of course this is untested, as you have not provided data to reproduce your example.

Upvotes: 5

Related Questions