Reputation: 1089
I'm having trouble using rowwise() to count the number of NAs in each row. My minimal example:
df <- data.frame(Q1 = c(rep(1, 1), rep(NA, 9)),
Q2 = c(rep(2, 2), rep(NA, 8)),
Q3 = c(rep(3, 3), rep(NA, 7))
)
df
Q1 Q2 Q3
1 1 2 3
2 NA 2 3
3 NA NA 3
4 NA NA NA
5 NA NA NA
6 NA NA NA
7 NA NA NA
8 NA NA NA
9 NA NA NA
10 NA NA NA
I would like to create a new column that counts the number of NAs in each row. I can do this very simply by writing
df$Count_NA <- rowSums(is.na(df))
df
Q1 Q2 Q3 Count_NA
1 1 2 3 0
2 NA 2 3 1
3 NA NA 3 2
4 NA NA NA 3
5 NA NA NA 3
6 NA NA NA 3
7 NA NA NA 3
8 NA NA NA 3
9 NA NA NA 3
10 NA NA NA 3
But if I try and do this via dplyr using rowwise(), I get the wrong answer - the column Count_NA has the same number in each row:
df %>%
rowwise() %>%
mutate(Count_NA = sum(is.na(.)))
# A tibble: 10 x 4
# Rowwise:
Q1 Q2 Q3 Count_NA
<dbl> <dbl> <dbl> <int>
1 1 2 3 24
2 NA 2 3 24
3 NA NA 3 24
4 NA NA NA 24
5 NA NA NA 24
6 NA NA NA 24
7 NA NA NA 24
8 NA NA NA 24
9 NA NA NA 24
10 NA NA NA 24
what am I doing wrong, and how do i fix this?
Many thanks in advance
Thomas Philips
Upvotes: 8
Views: 2590
Reputation: 1148
This might be late, however, an alternate way to address your problem where, for example, you needed to calculate NAs over different sets of columns would be like this:
# You can specify different column sets
cols_to_check1 <- c("Q1", "Q2", "Q3")
cols_to_check2 <- c("Q1", "Q2")
Then, you apply the rowSums
function with mutate()
df %>%
mutate(count_NA1 = rowSums(across(cols_to_check1, ~ is.na(.))),
count_NA2 = rowSums(across(cols_to_check2, ~ is.na(.))))
The solution for this would be as below:
Q1 Q2 Q3 count_NA1 count_NA2
1 1 2 3 0 0
2 NA 2 3 1 1
3 NA NA 3 2 2
4 NA NA NA 3 2
5 NA NA NA 3 2
6 NA NA NA 3 2
7 NA NA NA 3 2
8 NA NA NA 3 2
9 NA NA NA 3 2
10 NA NA NA 3 2
Upvotes: 1
Reputation: 886948
Using dapply
library(collapse)
dapply(df, function(x) sum(is.na(x)), MARGIN = 1)
#[1] 0 1 2 3 3 3 3 3 3 3
Upvotes: 3
Reputation: 41210
rowSums
works directly with mutate
without rowwise
:
df %>% mutate(count_NA = rowSums(is.na(.)))
Q1 Q2 Q3 count_NA
1 1 2 3 0
2 NA 2 3 1
3 NA NA 3 2
4 NA NA NA 3
5 NA NA NA 3
6 NA NA NA 3
7 NA NA NA 3
8 NA NA NA 3
9 NA NA NA 3
10 NA NA NA 3
Note that your initial solution is by far the fastest one:
microbenchmark::microbenchmark(
df$Count_NA <- rowSums(is.na(df)),
df$Count_NA <- apply(df, 1, function(x) sum(is.na(x))),
df %>% mutate(count_NA = rowSums(is.na(.))),
df %>%
mutate(Count_NA = purrr::pmap(., ~ sum(is.na(c(...))))),
df %>%
rowwise() %>%
mutate(a=sum(is.na(c_across(everything())))),
df %>%
rowwise() %>%
mutate(Count_NA = sum(is.na(cur_data()))) %>%
ungroup
)
Unit: microseconds
expr min lq
df$Count_NA <- rowSums(is.na(df)) 39.8 64.30
df$Count_NA <- apply(df, 1, function(x) sum(is.na(x))) 1661.6 1868.40
df %>% mutate(count_NA = rowSums(is.na(.))) 1181.7 1572.80
df %>% mutate(Count_NA = purrr::pmap(., ~sum(is.na(c(...))))) 4749.9 5190.35
df %>% rowwise() %>% mutate(a = sum(is.na(c_across(everything())))) 29124.1 31148.50
df %>% rowwise() %>% mutate(Count_NA = sum(is.na(cur_data()))) %>% ungroup 70473.0 73659.70
mean median uq max neval cld
79.033 76.25 88.75 174.0 100 a
2082.960 1966.50 2075.75 8777.3 100 b
1722.178 1676.20 1791.60 3112.9 100 b
5726.549 5396.40 5745.25 28592.1 100 c
33567.825 31983.05 33637.00 54676.9 100 d
77902.342 76492.85 81199.15 98942.1 100 e
Unit: microseconds
expr min lq
df$Count_NA <- rowSums(is.na(df)) 38.2 44.95
df$Count_NA <- apply(df, 1, function(x) sum(is.na(x))) 1584.8 1765.30
df %>% mutate(count_NA = rowSums(is.na(.))) 1247.9 1496.95
df %>% mutate(Count_NA = purrr::pmap(., ~sum(is.na(c(...))))) 4614.0 5110.50
df %>% rowwise() %>% mutate(Count_NA = sum(is.na(cur_data()))) %>% ungroup 67413.5 70865.45
mean median uq max neval cld
71.159 65.85 84.40 162.2 100 a
1967.629 1894.45 2093.30 3436.6 100 ab
1814.193 1666.25 1895.35 9031.0 100 a
5796.483 5380.70 5665.10 15309.7 100 b
78309.807 75275.30 79776.40 286964.3 100 c
Upvotes: 8
Reputation: 269451
Use cur_data()
rather than dot. .[cur_group_id(), ]
, c(Q1, Q2, Q3)
, across()
or c_across()
(or c_across
with the argument as per other answer) would also work.
Note that it is best to use ungroup
afterwards or else it will retain the memory of the rowwise and you might get unexpected results later on.
df %>%
rowwise() %>%
mutate(Count_NA = sum(is.na(cur_data()))) %>%
ungroup
giving:
# A tibble: 10 x 4
Q1 Q2 Q3 Count_NA
<dbl> <dbl> <dbl> <int>
1 1 2 3 0
2 NA 2 3 1
3 NA NA 3 2
4 NA NA NA 3
5 NA NA NA 3
6 NA NA NA 3
7 NA NA NA 3
8 NA NA NA 3
9 NA NA NA 3
10 NA NA NA 3
Upvotes: 9
Reputation: 26218
baseR answer
df$Count_NA <- apply(df, 1, function(x) sum(is.na(x)))
df
Q1 Q2 Q3 Count_NA
1 1 2 3 0
2 NA 2 3 1
3 NA NA 3 2
4 NA NA NA 3
5 NA NA NA 3
6 NA NA NA 3
7 NA NA NA 3
8 NA NA NA 3
9 NA NA NA 3
10 NA NA NA 3
So can be integrated into dplyr pipe
df %>% mutate(count_NA = apply(., 1, function(x) sum(is.na(x))))
Q1 Q2 Q3 count_NA
1 1 2 3 0
2 NA 2 3 1
3 NA NA 3 2
4 NA NA NA 3
5 NA NA NA 3
6 NA NA NA 3
7 NA NA NA 3
8 NA NA NA 3
9 NA NA NA 3
10 NA NA NA 3
Upvotes: 5
Reputation: 21908
In case future you were interested in a row-wise solution with purrr
package functions:
library(purrr)
df %>%
mutate(Count_NA = pmap(., ~ sum(is.na(c(...)))))
Q1 Q2 Q3 Count_NA
1 1 2 3 0
2 NA 2 3 1
3 NA NA 3 2
4 NA NA NA 3
5 NA NA NA 3
6 NA NA NA 3
7 NA NA NA 3
8 NA NA NA 3
9 NA NA NA 3
10 NA NA NA 3
Upvotes: 4
Reputation: 160407
One issue is that .
here resolves to the whole frame, not just the whole row.
Another dplyr method, using c_across
:
df %>%
rowwise() %>%
mutate(a=sum(is.na(c_across(everything()))))
# # A tibble: 10 x 4
# # Rowwise:
# Q1 Q2 Q3 a
# <dbl> <dbl> <dbl> <int>
# 1 1 2 3 0
# 2 NA 2 3 1
# 3 NA NA 3 2
# 4 NA NA NA 3
# 5 NA NA NA 3
# 6 NA NA NA 3
# 7 NA NA NA 3
# 8 NA NA NA 3
# 9 NA NA NA 3
# 10 NA NA NA 3
The biggest difference I can see between using this and cur_data()
is that c_across
allows for variable-selection a little more directly, as in c_across(starts_with("Q"))
. Granted, one could always select(cur_data(),...)
, so this is a weak argument.
Upvotes: 5