Reputation: 1037
I have data with different types of variables. Some are character, some factors, and some numeric, like below:
df <- data.frame(a = c("tt", "ss", "ss", NA), b=c(2,3,NA,1), c=c(1,2,NA, NA), d=c("tt", "ss", "ss", NA))
I'm trying to count the number of missing values per observation using c_across
in dplyr
However, c_across
doesn't seem to be able to combine different type of values, as the error message below suggests
df %>%
rowwise() %>%
summarise(NAs = sum(is.na(c_across())))
Error: Problem with
summarise()
inputNAs
. x Can't combinea
<factor> andb
. ℹ InputNAs
issum(is.na(c_across()))
. ℹ The error occurred in row 1.
Indeed, if I include only numeric variables, it works.
df %>%
rowwise() %>%
summarise(NAs = sum(is.na(c_across(b:c))))
Same thing if I include only character variables
df %>%
rowwise() %>%
summarise(NAs = sum(is.na(c_across(c(a,d)))))
I could solve the issue without using c_across
like below, but I have lots of variables, so it's not very practical.
df %>%
rowwise() %>%
summarise(NAs = is.na(a)+is.na(b)+is.na(c)+is.na(d))
I could use the traditional apply
approach, like below, but I'd like to solve this using dplyr
.
apply(df, 1, function(x)sum(is.na(x)))
Any suggestions as to how to compute the number of missing values, row-wise, efficiently, and using dplyr
?
Upvotes: 4
Views: 696
Reputation: 887981
A much faster option is not to use rowwise
or c_across
, but with rowSums
library(dplyr)
df %>%
mutate(NAs = rowSums(is.na(.)))
# a b c d NAs
#1 tt 2 1 tt 0
#2 ss 3 2 ss 0
#3 ss NA NA ss 2
#4 <NA> 1 NA <NA> 3
If we want to select
certain columns i.e. numeric
df %>%
mutate(NAs = rowSums(is.na(select(., where(is.numeric)))))
# a b c d NAs
#1 tt 2 1 tt 0
#2 ss 3 2 ss 0
#3 ss NA NA ss 2
#4 <NA> 1 NA <NA> 1
Upvotes: 1
Reputation: 39623
I would suggest this approach. The issue is because of two things. First, different type of variables in your dataframe an second that you need a key variable for the rowwise style task. So, in next code we first transform variables into a similar type, then we create an id based on the number of row. With this we use that element as input for rowwise()
and then we can use c_across()
function. Here the code (I have used you df
data):
library(tidyverse)
#Code
df %>%
mutate_at(vars(everything()),funs(as.character(.))) %>%
mutate(id=1:n()) %>%
rowwise(id) %>%
mutate(NAs = sum(is.na(c_across(a:d))))
Output:
# A tibble: 4 x 6
# Rowwise: id
a b c d id NAs
<chr> <chr> <chr> <chr> <int> <int>
1 tt 2 1 tt 1 0
2 ss 3 2 ss 2 0
3 ss NA NA ss 3 2
4 NA 1 NA NA 4 3
And we can avoid the mutate_at()
function using the new across()
with mutate()
to homologate the variables:
#Code 2
df %>%
mutate(across(a:d,~as.character(.))) %>%
mutate(id=1:n()) %>%
rowwise(id) %>%
mutate(NAs = sum(is.na(c_across(a:d))))
Output:
# A tibble: 4 x 6
# Rowwise: id
a b c d id NAs
<chr> <chr> <chr> <chr> <int> <int>
1 tt 2 1 tt 1 0
2 ss 3 2 ss 2 0
3 ss NA NA ss 3 2
4 NA 1 NA NA 4 3
Upvotes: 2