Reputation: 441
I would like to delete rows where column a has duplicate values based on the number of NAs in other columns. This is similar to this, but I can't get the counting NAs to work with the solutions there.
Deleting rows that are duplicated in one column based on the conditions of another column
Here is my toy dataset:
df1 <- data.frame(a = c("x","y","y","z","x", "z"), b = c(1,2,NA,4,8,3), c = c(NA,2,2,NA,NA,4), d= c(1:4,NA,NA))
which gives:
a b c d
1 x 1 NA 1
2 y 2 2 2
3 y NA 2 3
4 z 4 NA 4
5 x 8 NA NA
6 z 3 4 NA
I would like to only keep rows with unique values in column a, and keep only the rows with the least number of NAs in cols b & c (ignoring NAs in column d)
This is the code I came up with:
df1 %>%
mutate(NAs= apply(is.na(cbind(b,c)), 1, sum)) %>%
group_by(a) %>%
top_n(n=1, -NAs)
My problem is that top_n returns more than one row if there is a tie. In case of a tie, I just want the first row returned. And there's probably a better way to select columns in mutate than cbind. I also don't need the "NAs" variable i created using mutate. My desired output is this:
a b c d
x 1 NA 1
y 2 2 2
z 3 4 NA
Upvotes: 1
Views: 115
Reputation: 26343
Here is an option
library(dplyr)
df1 %>%
mutate(NAs = rowSums(is.na(.[, c("b", "c")]))) %>%
group_by(a) %>%
top_n(n = 1, -NAs) %>%
slice(1) %>%
select(-NAs)
# A tibble: 3 x 4
# Groups: a [3]
# a b c d
# <fct> <dbl> <dbl> <int>
#1 x 1 NA 1
#2 y 2 2 2
#3 z 3 4 NA
rowSums
is a more efficient alternative to apply(..., 1, sum)
.
You might also try data.table
. The solution below should be really fast (but potentially less readable).
library(data.table)
setDT(df1)
df1[df1[order(a, df1[, rowSums(is.na(.SD)), .SDcols = c("b", "c")]), .I[1], by = "a"]$V1]
# a b c d
#1: x 1 NA 1
#2: y 2 2 2
#3: z 3 4 NA
Upvotes: 2
Reputation: 39858
A slightly different dplyr
option:
df1 %>%
mutate(miss = rowSums(is.na(cbind(b,c)))) %>%
group_by(a) %>%
filter(miss == min(miss)) %>%
slice(1) %>%
select(-miss) %>%
ungroup()
Or:
df1 %>%
mutate(miss = rowSums(is.na(cbind(b,c)))) %>%
group_by(a) %>%
mutate(dupl = seq_along(a)) %>%
filter(miss == min(miss)) %>%
filter(dupl == min(dupl)) %>%
select(-miss, -dupl) %>%
ungroup()
Upvotes: 1
Reputation: 14764
@markus suggested this could be an answer too. Perhaps it's true, as it can be helpful to keep the code short in case of dplyr
, otherwise you can often end up with quite verbose scripts.
However, the main part in my opinion is rowSums
as it's already been highlighted.
df1 %>%
arrange(a, rowSums(is.na(.[, c("b", "c")]))) %>%
distinct(a, .keep_all = TRUE)
a b c d
1 x 1 NA 1
2 y 2 2 2
3 z 3 4 NA
P.S. If speed is your concern, then indeed I'd try to use as little dplyr
verbs as possible, as e.g. the approach with only arrange
& distinct
is 3x faster than other approaches with group
, slice
, top_n
, filter
, etc.
Upvotes: 3