Esther
Esther

Reputation: 441

delete rows that have duplicates based on a column dependening on number of NAs in different column

I would like to delete rows where column a has duplicate values based on the number of NAs in other columns. This is similar to this, but I can't get the counting NAs to work with the solutions there.

Deleting rows that are duplicated in one column based on the conditions of another column

Here is my toy dataset:

df1 <- data.frame(a = c("x","y","y","z","x", "z"), b = c(1,2,NA,4,8,3), c = c(NA,2,2,NA,NA,4), d= c(1:4,NA,NA))

which gives:

  a  b  c  d
1 x  1 NA  1
2 y  2  2  2
3 y NA  2  3
4 z  4 NA  4
5 x  8 NA NA
6 z  3  4 NA

I would like to only keep rows with unique values in column a, and keep only the rows with the least number of NAs in cols b & c (ignoring NAs in column d)

This is the code I came up with:

df1 %>%
 mutate(NAs= apply(is.na(cbind(b,c)), 1, sum)) %>%     
 group_by(a) %>%
 top_n(n=1, -NAs)

My problem is that top_n returns more than one row if there is a tie. In case of a tie, I just want the first row returned. And there's probably a better way to select columns in mutate than cbind. I also don't need the "NAs" variable i created using mutate. My desired output is this:

  a  b  c  d
  x  1 NA  1
  y  2  2  2
  z  3  4 NA

Upvotes: 1

Views: 115

Answers (3)

markus
markus

Reputation: 26343

Here is an option

library(dplyr)
df1 %>%
  mutate(NAs = rowSums(is.na(.[, c("b", "c")]))) %>%
  group_by(a) %>%
  top_n(n = 1, -NAs) %>% 
  slice(1) %>% 
  select(-NAs)
# A tibble: 3 x 4
# Groups:   a [3]
#  a         b     c     d
#  <fct> <dbl> <dbl> <int>
#1 x         1    NA     1
#2 y         2     2     2
#3 z         3     4    NA

rowSums is a more efficient alternative to apply(..., 1, sum).


You might also try data.table. The solution below should be really fast (but potentially less readable).

library(data.table)
setDT(df1)
df1[df1[order(a, df1[, rowSums(is.na(.SD)), .SDcols = c("b", "c")]), .I[1], by = "a"]$V1]
#   a b  c  d
#1: x 1 NA  1
#2: y 2  2  2
#3: z 3  4 NA

Upvotes: 2

tmfmnk
tmfmnk

Reputation: 39858

A slightly different dplyr option:

df1 %>%
  mutate(miss = rowSums(is.na(cbind(b,c)))) %>%
  group_by(a) %>%
  filter(miss == min(miss)) %>%
  slice(1) %>%
  select(-miss) %>%
  ungroup()

Or:

df1 %>%
  mutate(miss = rowSums(is.na(cbind(b,c)))) %>%
  group_by(a) %>%
  mutate(dupl = seq_along(a)) %>%
  filter(miss == min(miss)) %>%
  filter(dupl == min(dupl)) %>%
  select(-miss, -dupl) %>%
  ungroup()

Upvotes: 1

arg0naut91
arg0naut91

Reputation: 14764

@markus suggested this could be an answer too. Perhaps it's true, as it can be helpful to keep the code short in case of dplyr, otherwise you can often end up with quite verbose scripts.

However, the main part in my opinion is rowSums as it's already been highlighted.

df1 %>% 
arrange(a, rowSums(is.na(.[, c("b", "c")]))) %>% 
distinct(a, .keep_all = TRUE)

  a b  c  d
1 x 1 NA  1
2 y 2  2  2
3 z 3  4 NA

P.S. If speed is your concern, then indeed I'd try to use as little dplyr verbs as possible, as e.g. the approach with only arrange & distinct is 3x faster than other approaches with group, slice, top_n, filter, etc.

Upvotes: 3

Related Questions