Reputation: 21
I am working with a data frame that produces two output columns. One column always has more NA values than the other column, but not in any predictable fashion. here is my question, how can I use dplyr to select the column with the fewest number of NA values. I was thinking of utilizing which.min to decide, but not sure how to put it all together. Note that both columns contain na values, and I want to select the one with the fewest of those values.
Upvotes: 2
Views: 1152
Reputation: 667
library(dplyr)
df <- tibble(a = c(rep(c(NA, 1:5), 4)), # df with different NA counts/col
b = c(rep(c(NA, NA, 2:5), 4)))
df %>%
summarise_all(funs(sum(is.na(.)))) # NA counts
#> # A tibble: 1 x 2
#> a b
#> <int> <int>
#> 1 4 8
df %>% # answer
select_if(funs(which.min(sum(is.na(.)))))
#> # A tibble: 24 x 1
#> a
#> <int>
#> 1 NA
#> 2 1
#> 3 2
#> 4 3
#> 5 4
#> 6 5
#> 7 NA
#> 8 1
#> 9 2
#> 10 3
#> # ... with 14 more rows
Created on 2018-05-25 by the reprex package (v0.2.0).
Upvotes: 0
Reputation: 23608
You can do this with dplyr
and purrr
.
inside which.min
you first calculate the number of NA's in the columns with map
(can be as many columns as you have in your data.frame. The keep
part returns only those columns which actually have NA's. The which.min
returns the named vector of which we take the name and supply it to the select
function of dplyr
.
I have outlined the code a bit so you can easily see which parts belong where.
library(purrr)
library(dplyr)
df %>% select(names(which.min(df %>%
map(function(x) sum(is.na(x))) %>%
keep(~ .x > 0)
)
)
)
Upvotes: 1