Reputation: 21

Select column that has the fewest NA values

I am working with a data frame that produces two output columns. One column always has more NA values than the other column, but not in any predictable fashion. here is my question, how can I use dplyr to select the column with the fewest number of NA values. I was thinking of utilizing which.min to decide, but not sure how to put it all together. Note that both columns contain na values, and I want to select the one with the fewest of those values.

Upvotes: 2

Answers (2)

knapply

Reputation: 667

library(dplyr)

df <- tibble(a = c(rep(c(NA, 1:5), 4)),      # df with different NA counts/col
             b = c(rep(c(NA, NA, 2:5), 4)))



df %>% 
  summarise_all(funs(sum(is.na(.))))         # NA counts
#> # A tibble: 1 x 2
#>       a     b
#>   <int> <int>
#> 1     4     8

df %>%                                       # answer
  select_if(funs(which.min(sum(is.na(.)))))
#> # A tibble: 24 x 1
#>        a
#>    <int>
#>  1    NA
#>  2     1
#>  3     2
#>  4     3
#>  5     4
#>  6     5
#>  7    NA
#>  8     1
#>  9     2
#> 10     3
#> # ... with 14 more rows

Created on 2018-05-25 by the reprex package (v0.2.0).

Upvotes: 0

phiver

Reputation: 23608

You can do this with dplyr and purrr.

inside which.min you first calculate the number of NA's in the columns with map (can be as many columns as you have in your data.frame. The keep part returns only those columns which actually have NA's. The which.min returns the named vector of which we take the name and supply it to the select function of dplyr.

I have outlined the code a bit so you can easily see which parts belong where.

library(purrr)
library(dplyr)


df %>% select(names(which.min(df %>% 
                                map(function(x) sum(is.na(x))) %>% 
                                keep(~ .x > 0)
                             )
                    )
              )

Upvotes: 1

Select column that has the fewest NA values

Answers (2)

Related Questions