Reputation: 255
I'd like to show the names of columns in a large dataframe that contain missing values. Basically, I want the equivalent of complete.cases(df) but for columns, not rows. Some of the columns are non-numeric, so something like
names(df[is.na(colMeans(df))])
returns "Error in colMeans(df) : 'x' must be numeric." So, my current solution is to transpose the dataframe and run complete.cases, but I'm guessing there's some variant of apply (or something in plyr) that's much more efficient.
nacols <- function(df) {
names(df[,!complete.cases(t(df))])
}
w <- c("hello","goodbye","stuff")
x <- c(1,2,3)
y <- c(1,NA,0)
z <- c(1,0, NA)
tmp <- data.frame(w,x,y,z)
nacols(tmp)
[1] "y" "z"
Can someone show me a more efficient function to identify columns that have NAs?
Upvotes: 20
Views: 18426
Reputation: 109874
This is the fastest way that I know of:
unlist(lapply(df, function(x) any(is.na(x))))
EDIT:
I guess everyone else wrote it out complete so here it is complete:
nacols <- function(df) {
colnames(df)[unlist(lapply(df, function(x) any(is.na(x))))]
}
And if you microbenchmark the 4 solutions on a WIN 7 machine:
Unit: microseconds
expr min lq median uq max
1 ANDRIE 85.380 91.911 106.375 116.639 863.124
2 MANOEL 87.712 93.778 105.908 118.971 8426.886
3 MOIRA 764.215 798.273 817.402 876.188 143039.632
4 TYLER 51.321 57.853 62.518 72.316 1365.136
And here's a visual of that:
Edit At the time I wrote this anyNA
did not exist or I was unaware of it. This may speed things up moreso...per the help manual for ?anyNA
:
The generic function
anyNA
implementsany(is.na(x))
in a possibly faster way (especially for atomic vectors).
nacols <- function(df) {
colnames(df)[unlist(lapply(df, function(x) anyNA(x)))]
}
Upvotes: 28
Reputation: 179428
One way...
nacols <- function(x){
y <- sapply(x, function(xx)any(is.na(xx)))
names(y[y])
}
nacols(tmp)
[1] "y" "z"
Explanation: since the result y
is a logical vector, names(y[y])
returns the names of y for only those cases where y
is TRUE.
Upvotes: 6
Reputation: 2396
Here is one way:
colnames(tmp)[colSums(is.na(tmp)) > 0]
Hope it helps,
Manoel
Upvotes: 6