msoftrain
msoftrain

Reputation: 1037

How to summarise across different types of variables with dplyr::c_across()

I have data with different types of variables. Some are character, some factors, and some numeric, like below:

df <- data.frame(a = c("tt", "ss", "ss", NA), b=c(2,3,NA,1), c=c(1,2,NA, NA), d=c("tt", "ss", "ss", NA))

I'm trying to count the number of missing values per observation using c_across in dplyr However, c_across doesn't seem to be able to combine different type of values, as the error message below suggests

df %>%
  rowwise() %>%
  summarise(NAs = sum(is.na(c_across())))

Error: Problem with summarise() input NAs. x Can't combine a <factor> and b . ℹ Input NAs is sum(is.na(c_across())). ℹ The error occurred in row 1.

Indeed, if I include only numeric variables, it works.

df %>%
  rowwise() %>%
  summarise(NAs = sum(is.na(c_across(b:c))))

Same thing if I include only character variables

df %>%
  rowwise() %>%
  summarise(NAs = sum(is.na(c_across(c(a,d)))))

I could solve the issue without using c_across like below, but I have lots of variables, so it's not very practical.

df %>%
  rowwise() %>%
  summarise(NAs = is.na(a)+is.na(b)+is.na(c)+is.na(d))

I could use the traditional apply approach, like below, but I'd like to solve this using dplyr.

apply(df, 1, function(x)sum(is.na(x)))

Any suggestions as to how to compute the number of missing values, row-wise, efficiently, and using dplyr?

Upvotes: 4

Views: 696

Answers (2)

akrun
akrun

Reputation: 887981

A much faster option is not to use rowwise or c_across, but with rowSums

library(dplyr)
df %>% 
     mutate(NAs = rowSums(is.na(.)))
#     a  b  c    d NAs
#1   tt  2  1   tt   0
#2   ss  3  2   ss   0
#3   ss NA NA   ss   2
#4 <NA>  1 NA <NA>   3

If we want to select certain columns i.e. numeric

df %>%
   mutate(NAs = rowSums(is.na(select(., where(is.numeric)))))
#     a  b  c    d NAs
#1   tt  2  1   tt   0
#2   ss  3  2   ss   0
#3   ss NA NA   ss   2
#4 <NA>  1 NA <NA>   1

Upvotes: 1

Duck
Duck

Reputation: 39623

I would suggest this approach. The issue is because of two things. First, different type of variables in your dataframe an second that you need a key variable for the rowwise style task. So, in next code we first transform variables into a similar type, then we create an id based on the number of row. With this we use that element as input for rowwise() and then we can use c_across() function. Here the code (I have used you df data):

library(tidyverse)
#Code
df %>% 
  mutate_at(vars(everything()),funs(as.character(.))) %>%
  mutate(id=1:n()) %>%
  rowwise(id) %>%
  mutate(NAs = sum(is.na(c_across(a:d))))

Output:

# A tibble: 4 x 6
# Rowwise:  id
  a     b     c     d        id   NAs
  <chr> <chr> <chr> <chr> <int> <int>
1 tt    2     1     tt        1     0
2 ss    3     2     ss        2     0
3 ss    NA    NA    ss        3     2
4 NA    1     NA    NA        4     3

And we can avoid the mutate_at() function using the new across() with mutate() to homologate the variables:

#Code 2
df %>% 
  mutate(across(a:d,~as.character(.))) %>%
  mutate(id=1:n()) %>%
  rowwise(id) %>%
  mutate(NAs = sum(is.na(c_across(a:d))))

Output:

# A tibble: 4 x 6
# Rowwise:  id
  a     b     c     d        id   NAs
  <chr> <chr> <chr> <chr> <int> <int>
1 tt    2     1     tt        1     0
2 ss    3     2     ss        2     0
3 ss    NA    NA    ss        3     2
4 NA    1     NA    NA        4     3

Upvotes: 2

Related Questions