Reputation: 165
I have a data frame with about 111 columns, few of the columns are entirely filled with zeros and NAs.I want to programmatically identify the column names that have only zeroes and NAs,
I used the below to find columns that entirely have NAs
a<-data[,((colSums(is.na(data)) == nrow(data))==T)]
names(a)
Now I want to find column names that are filled with a mixture of Zeroes and NAs ,how can I do that ?
Upvotes: 1
Views: 669
Reputation: 20095
One can try using dplyr::select_if
to get the names of columns having only 0
or NA
.
library(dplyr)
df %>% select_if(funs(all(is.na(na_if(.,0))))) %>% names()
#[1] "a" "d" "f"
Data: Sample data taken from @MauritsEvers
Upvotes: 1
Reputation: 16099
Using setequal
as the thelatemail did is the clearest, as well as distinguishing between NaN
and NA
s. If you really need it to be fast though, you may want to consider:
only_zeroes_and_NAs <- function(v) {
is.numeric(v) &&
anyNA(v) &&
suppressWarnings({
# min(v, na.rm = TRUE) will emit a
# warning if all are NA, which we can ignore
min(v, na.rm = TRUE) == 0 &&
max(v, na.rm = TRUE) == 0
})
}
only_zeroes_and_NAs_or_NaN <- function(v) {
is.numeric(v) &&
anyNA(v) &&
suppressWarnings({
# min(v, na.rm = TRUE) will emit a
# warning if all are NA, which we can ignore
min(v, na.rm = TRUE) == 0 &&
max(v, na.rm = TRUE) == 0
}) &&
!any(is.nan(v))
}
set.seed(6)
N = 10e6
dat <- data.frame(a=sample(1:4, size = N, replace = TRUE),
b=sample(c(0,NA,1,NA), size = N, replace = TRUE),
c=sample(c(0,NA,0,NA), size = N, replace = TRUE),
d=sample(c(0,NaN,0,NaN), size = N, replace = TRUE),
e=NA, f=0, g = NA_real_)
bench::mark(me = names(dat)[sapply(dat, only_zeroes_and_NAs)],
me_nan = names(dat)[sapply(dat, only_zeroes_and_NAs_or_NaN)],
setequal = names(dat)[sapply(dat, setequal, c(0, NA))],
check = FALSE,
filter_gc = FALSE)
#> # A tibble: 3 x 10
#> expression min mean median max `itr/sec` mem_alloc n_gc
#> <chr> <bch:tm> <bch:tm> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 me 331.3ms 331.5ms 331.5ms 331.7ms 3.02 140.16KB 0
#> 2 me_nan 453.2ms 467.5ms 467.5ms 481.8ms 2.14 76.34MB 1
#> 3 setequal 2.1s 2.1s 2.1s 2.1s 0.477 1.69GB 6
#> # ... with 2 more variables: n_itr <int>, total_time <bch:tm>
Created on 2018-06-28 by the reprex package (v0.2.0).
Upvotes: 1
Reputation: 93908
This is a job for setequal
:
dat <- data.frame(a=1:4, b=c(0,NA,1,NA), c=c(0,NA,0,NA), d=c(0,NaN,0,NaN), e=NA, f=0)
names(dat)[sapply(dat, setequal, c(0,NA))]
#[1] "c"
setequal
essentially compares the unique values of two vectors to make sure they are totally overlapping. See here: What is the difference between setequal(a,b) and identical(a,b) in R? for more info.
Upvotes: 7
Reputation: 50718
As you don't provide a minimal representative dataset, we first generate some sample data
set.seed(2017);
df <- setNames(data.frame(
rep(0, 10),
runif(10),
runif(10),
rep(NA, 10),
runif(10),
sample(c(NA, 0), 10, replace = T)), letters[1:6])
df;
# a b c d e f
#1 0 0.92424261 0.674331481 NA 0.63411352 0
#2 0 0.53717641 0.002020766 NA 0.37986744 NA
#3 0 0.46919565 0.025093514 NA 0.94207403 NA
#4 0 0.28862618 0.432077786 NA 0.75499369 NA
#5 0 0.77008816 0.499391912 NA 0.22761184 0
#6 0 0.77276871 0.388681932 NA 0.91466603 0
#7 0 0.03932234 0.395375316 NA 0.62044504 NA
#8 0 0.43490560 0.715707325 NA 0.31910458 NA
#9 0 0.47216639 0.940999879 NA 0.07628881 NA
#10 0 0.27383312 0.827229161 NA 0.26083932 0
Then we can use all
with is.na
and x == 0
to identify null/NA
columns
names(df)[sapply(df, function(x) all(is.na(x) | x == 0))]
#[1] "a" "d" "f"
Upvotes: 4