Sarah
Sarah

Reputation: 3519

Remove duplicates in one column based on another column

I'm looking for a nicer way to do this in R. I do have one possibility but it seems like there should be a smart/more readable way.

I want to delete duplicates in one/more column only if a condition is met in another column (or columns).

In my simplified example I want to delete duplicates in column X only if column Y is NA, but keep NA's in Y without a duplicated X.

testDF<- data.frame(X= c(1:4,4:8,8:12), Y = 1:14)
testDF$Y[c(4,6,10)]<- NA

My current solution is:

testDF[!(testDF$X %in% testDF$X[which(duplicated(testDF$X))] & is.na(testDF$Y)),]

or

library(dplyr)

testDF %>% 
    dplyr::filter(!(testDF$X%in% testDF$X[which(duplicated(testDF$X))] & is.na(testDF$Y))) 

which both appear messy and confusing, and in a real application where I am going to be looking at more than two columns could get unworkable.

I was hoping for something more along the lines of:

 testDF %>% dplyr::filter(!(duplicated(X) & is.na(Y))) 

but it duplicated() only identifies the second instance of a duplication so if Y's NA is in line with the first of the duplicated X values then it will not be filtered out.

Preferably looking for a base or tidyverse solution as none of the rest of the script is using data.table

Upvotes: 2

Views: 3086

Answers (3)

DDZR
DDZR

Reputation: 1

I had a similar idea like what @www suggested, arrange X and Y first, and then filter instead of slice.

The first filtering condition would be row_number()==1 , as if row_number == 1 it means X is distinct.
The second filtering condition would be !is.na(Y) , which would only keep any X whose Y is not NA.

testDF %>% arrange(X,Y) %>%  
      group_by(X) %>% filter(row_number()==1 | !is.na(Y)) %>% ungroup()

Upvotes: 0

Melissa Key
Melissa Key

Reputation: 4551

You can also just apply duplicated from both directions:

testDF %>%
  filter(!is.na(Y) | (!duplicated(X) & !duplicated(X, fromLast = TRUE) ))

(highly influenced by this: Find duplicated elements with dplyr - I'll let others decide if this is close enough to be a duplicate)

To make your code even more readable, you can even put this in a function (perhaps with a better function name than mine):

all_duplicates <- function(x) {
  duplicated(x) | duplicated(x, fromLast = TRUE)
}
testDF %>%
  filter(!is.na(Y) | !all_duplicates(X) )

Upvotes: 3

www
www

Reputation: 39174

We can arrange the columns, group by X, and then slice the first row. By doing this, we can get the non-NA row in Y if that row exists.

library(dplyr)

testDF %>%
  arrange(X, Y) %>%
  group_by(X) %>%
  slice(1) %>%
  ungroup()
# # A tibble: 12 x 2
#        X     Y
#    <int> <int>
#  1     1     1
#  2     2     2
#  3     3     3
#  4     4     5
#  5     5    NA
#  6     6     7
#  7     7     8
#  8     8     9
#  9     9    11
# 10    10    12
# 11    11    13
# 12    12    14

Upvotes: 2

Related Questions