Hariharan S
Hariharan S

Reputation: 165

I want to identify columns names that are filled with zeroes and NAs in R

I have a data frame with about 111 columns, few of the columns are entirely filled with zeros and NAs.I want to programmatically identify the column names that have only zeroes and NAs,

I used the below to find columns that entirely have NAs

a<-data[,((colSums(is.na(data)) == nrow(data))==T)]
names(a)

Now I want to find column names that are filled with a mixture of Zeroes and NAs ,how can I do that ?

Upvotes: 1

Views: 669

Answers (4)

MKR
MKR

Reputation: 20095

One can try using dplyr::select_if to get the names of columns having only 0 or NA.

library(dplyr)
df %>% select_if(funs(all(is.na(na_if(.,0))))) %>% names()
#[1] "a" "d" "f"

Data: Sample data taken from @MauritsEvers

Upvotes: 1

Hugh
Hugh

Reputation: 16099

Using setequal as the thelatemail did is the clearest, as well as distinguishing between NaN and NAs. If you really need it to be fast though, you may want to consider:

only_zeroes_and_NAs <- function(v) {
  is.numeric(v) &&
    anyNA(v) &&
    suppressWarnings({
      # min(v, na.rm = TRUE) will emit a 
      # warning if all are NA, which we can ignore
      min(v, na.rm = TRUE) == 0 &&
        max(v, na.rm = TRUE) == 0
    })
}

only_zeroes_and_NAs_or_NaN <- function(v) {
  is.numeric(v) &&
    anyNA(v) &&
    suppressWarnings({
      # min(v, na.rm = TRUE) will emit a 
      # warning if all are NA, which we can ignore
      min(v, na.rm = TRUE) == 0 &&
        max(v, na.rm = TRUE) == 0
    }) &&
    !any(is.nan(v))
}

set.seed(6)
N = 10e6
dat <- data.frame(a=sample(1:4, size = N, replace = TRUE),
                  b=sample(c(0,NA,1,NA), size = N, replace = TRUE),
                  c=sample(c(0,NA,0,NA), size = N, replace = TRUE),
                  d=sample(c(0,NaN,0,NaN), size = N, replace = TRUE),
                  e=NA, f=0, g = NA_real_)

bench::mark(me = names(dat)[sapply(dat, only_zeroes_and_NAs)],
            me_nan = names(dat)[sapply(dat, only_zeroes_and_NAs_or_NaN)],
            setequal = names(dat)[sapply(dat, setequal, c(0, NA))], 
            check = FALSE,
            filter_gc = FALSE)
#> # A tibble: 3 x 10
#>   expression      min     mean   median      max `itr/sec` mem_alloc  n_gc
#>   <chr>      <bch:tm> <bch:tm> <bch:tm> <bch:tm>     <dbl> <bch:byt> <dbl>
#> 1 me          331.3ms  331.5ms  331.5ms  331.7ms     3.02   140.16KB     0
#> 2 me_nan      453.2ms  467.5ms  467.5ms  481.8ms     2.14    76.34MB     1
#> 3 setequal       2.1s     2.1s     2.1s     2.1s     0.477    1.69GB     6
#> # ... with 2 more variables: n_itr <int>, total_time <bch:tm>

Created on 2018-06-28 by the reprex package (v0.2.0).

Upvotes: 1

thelatemail
thelatemail

Reputation: 93908

This is a job for setequal:

dat <- data.frame(a=1:4, b=c(0,NA,1,NA), c=c(0,NA,0,NA), d=c(0,NaN,0,NaN), e=NA, f=0)
names(dat)[sapply(dat, setequal, c(0,NA))]
#[1] "c"

setequal essentially compares the unique values of two vectors to make sure they are totally overlapping. See here: What is the difference between setequal(a,b) and identical(a,b) in R? for more info.

Upvotes: 7

Maurits Evers
Maurits Evers

Reputation: 50718

As you don't provide a minimal representative dataset, we first generate some sample data

set.seed(2017);
df <- setNames(data.frame(
    rep(0, 10),
    runif(10),
    runif(10),
    rep(NA, 10),
    runif(10),
    sample(c(NA, 0), 10, replace = T)), letters[1:6])
df;
#   a          b           c  d          e  f
#1  0 0.92424261 0.674331481 NA 0.63411352  0
#2  0 0.53717641 0.002020766 NA 0.37986744 NA
#3  0 0.46919565 0.025093514 NA 0.94207403 NA
#4  0 0.28862618 0.432077786 NA 0.75499369 NA
#5  0 0.77008816 0.499391912 NA 0.22761184  0
#6  0 0.77276871 0.388681932 NA 0.91466603  0
#7  0 0.03932234 0.395375316 NA 0.62044504 NA
#8  0 0.43490560 0.715707325 NA 0.31910458 NA
#9  0 0.47216639 0.940999879 NA 0.07628881 NA
#10 0 0.27383312 0.827229161 NA 0.26083932  0

Then we can use all with is.na and x == 0 to identify null/NA columns

names(df)[sapply(df, function(x) all(is.na(x) | x == 0))]
#[1] "a" "d" "f"

Upvotes: 4

Related Questions