Kevin Zarca
Kevin Zarca

Reputation: 2732

Equivalent of apply() by row in the tidyverse?

I want to insert a new column into a data.frame, which value is TRUE when there is at least one missing value in the row and FALSE otherwise.

For that problem, apply is a a perfect use case:

EDIT - added example

tab <- data.frame(a = 1:10, b = c(NA, letters[2:10]), c = c(LETTERS[1:9], NA))

tab$missing <- apply(tab, 1, function(x) any(is.na(x)))

However, I loaded the strict package, and got this error: apply() coerces X to a matrix so is dangerous to use with data frames.Please use lapply() instead.

I know that I can safely ignore this error, however, I was wondering if there was a way to code it using one of the tidyverse packages, in a simple manner. I tried unsuccessfully with dplyr:

tab %>% 
  rowwise() %>% 
  mutate(missing = any(is.na(.), na.rm = TRUE))

Upvotes: 11

Views: 9468

Answers (3)

wint3rschlaefer
wint3rschlaefer

Reputation: 309

You can use the complete.cases function:

tab %>% mutate(missing = !complete.cases(.))

To remove rows with one or more NAs, use:

tab %>% filter(complete.cases(.))

Upvotes: 1

Rory Shaw
Rory Shaw

Reputation: 851

This works for the example data:

library(tidyverse)

tab <- data_frame(a = 1:10, 
                  b = c(NA, letters[2:10]), 
                  c = c(LETTERS[1:9], NA))

tab_1 <- tab %>% mutate(missing = ifelse(is.na(b), TRUE, ifelse(is.na(c), TRUE, FALSE)))

> tab_1
    a    b    c missing
1   1 <NA>    A    TRUE
2   2    b    B   FALSE
3   3    c    C   FALSE
4   4    d    D   FALSE
5   5    e    E   FALSE
6   6    f    F   FALSE
7   7    g    G   FALSE
8   8    h    H   FALSE
9   9    i    I   FALSE
10 10    j <NA>    TRUE

Upvotes: 1

alistaire
alistaire

Reputation: 43334

If you want to avoid coercing to a matrix, you can use purrr::pmap, which iterates across the elements of a list in parallel and passes them to a function:

library(tidyverse)

tab <- data_frame(a = 1:10, 
                  b = c(NA, letters[2:10]), 
                  c = c(LETTERS[1:9], NA))

tab %>% mutate(missing = pmap_lgl(., ~any(is.na(c(...)))))
#> # A tibble: 10 x 4
#>        a     b     c missing
#>    <int> <chr> <chr>   <lgl>
#>  1     1  <NA>     A    TRUE
#>  2     2     b     B   FALSE
#>  3     3     c     C   FALSE
#>  4     4     d     D   FALSE
#>  5     5     e     E   FALSE
#>  6     6     f     F   FALSE
#>  7     7     g     G   FALSE
#>  8     8     h     H   FALSE
#>  9     9     i     I   FALSE
#> 10    10     j  <NA>    TRUE

In the function, c is necessary to pull all the parameters passed to the function ... into a vector, which can be passed to is.na and collapsed with any. The *_lgl suffixed pmap simplifies the result to a Boolean vector.

Note that while this avoids coercing to matrix, it will not necessarily be faster than approaches that do, as matrix operations are highly optimized in R. It may make more sense to explicitly coerce to a matrix, e.g.

tab %>% mutate(missing = rowSums(is.na(as.matrix(.))) > 0)

which returns the same thing.

Upvotes: 9

Related Questions