Moira
Moira

Reputation: 255

Show columns with NAs in a data.frame

I'd like to show the names of columns in a large dataframe that contain missing values. Basically, I want the equivalent of complete.cases(df) but for columns, not rows. Some of the columns are non-numeric, so something like

names(df[is.na(colMeans(df))])

returns "Error in colMeans(df) : 'x' must be numeric." So, my current solution is to transpose the dataframe and run complete.cases, but I'm guessing there's some variant of apply (or something in plyr) that's much more efficient.

nacols <- function(df) {
  names(df[,!complete.cases(t(df))])
} 

w <- c("hello","goodbye","stuff")
x <- c(1,2,3)
y <- c(1,NA,0)
z <- c(1,0, NA)
tmp <- data.frame(w,x,y,z)

nacols(tmp)
[1] "y" "z"

Can someone show me a more efficient function to identify columns that have NAs?

Upvotes: 20

Views: 18426

Answers (3)

Tyler Rinker
Tyler Rinker

Reputation: 109874

This is the fastest way that I know of:

unlist(lapply(df, function(x) any(is.na(x))))

EDIT:

I guess everyone else wrote it out complete so here it is complete:

nacols <- function(df) {
    colnames(df)[unlist(lapply(df, function(x) any(is.na(x))))]
}

And if you microbenchmark the 4 solutions on a WIN 7 machine:

Unit: microseconds
    expr     min      lq  median      uq        max
1 ANDRIE  85.380  91.911 106.375 116.639    863.124
2 MANOEL  87.712  93.778 105.908 118.971   8426.886
3  MOIRA 764.215 798.273 817.402 876.188 143039.632
4  TYLER  51.321  57.853  62.518  72.316   1365.136

And here's a visual of that: enter image description here

Edit At the time I wrote this anyNA did not exist or I was unaware of it. This may speed things up moreso...per the help manual for ?anyNA:

The generic function anyNA implements any(is.na(x)) in a possibly faster way (especially for atomic vectors).

nacols <- function(df) {
    colnames(df)[unlist(lapply(df, function(x) anyNA(x)))]
}

Upvotes: 28

Andrie
Andrie

Reputation: 179428

One way...

nacols <- function(x){
  y <- sapply(x, function(xx)any(is.na(xx)))
  names(y[y])
}  

nacols(tmp)
[1] "y" "z"

Explanation: since the result y is a logical vector, names(y[y]) returns the names of y for only those cases where y is TRUE.

Upvotes: 6

Manoel Galdino
Manoel Galdino

Reputation: 2396

Here is one way:

colnames(tmp)[colSums(is.na(tmp)) > 0]

Hope it helps,

Manoel

Upvotes: 6

Related Questions