Reputation: 3519
I'm looking for a nicer way to do this in R. I do have one possibility but it seems like there should be a smart/more readable way.
I want to delete duplicates in one/more column only if a condition is met in another column (or columns).
In my simplified example I want to delete duplicates in column X
only if column Y
is NA
, but keep NA
's in Y without a duplicated X.
testDF<- data.frame(X= c(1:4,4:8,8:12), Y = 1:14)
testDF$Y[c(4,6,10)]<- NA
My current solution is:
testDF[!(testDF$X %in% testDF$X[which(duplicated(testDF$X))] & is.na(testDF$Y)),]
or
library(dplyr)
testDF %>%
dplyr::filter(!(testDF$X%in% testDF$X[which(duplicated(testDF$X))] & is.na(testDF$Y)))
which both appear messy and confusing, and in a real application where I am going to be looking at more than two columns could get unworkable.
I was hoping for something more along the lines of:
testDF %>% dplyr::filter(!(duplicated(X) & is.na(Y)))
but it duplicated()
only identifies the second instance of a duplication so if Y
's NA
is in line with the first of the duplicated X values then it will not be filtered out.
Preferably looking for a base or tidyverse solution as none of the rest of the script is using data.table
Upvotes: 2
Views: 3086
Reputation: 1
I had a similar idea like what @www suggested, arrange X and Y first, and then filter
instead of slice
.
The first filtering condition would be row_number()==1
, as if row_number == 1
it means X is distinct.
The second filtering condition would be !is.na(Y)
, which would only keep any X whose Y is not NA.
testDF %>% arrange(X,Y) %>%
group_by(X) %>% filter(row_number()==1 | !is.na(Y)) %>% ungroup()
Upvotes: 0
Reputation: 4551
You can also just apply duplicated
from both directions:
testDF %>%
filter(!is.na(Y) | (!duplicated(X) & !duplicated(X, fromLast = TRUE) ))
(highly influenced by this: Find duplicated elements with dplyr - I'll let others decide if this is close enough to be a duplicate)
To make your code even more readable, you can even put this in a function (perhaps with a better function name than mine):
all_duplicates <- function(x) {
duplicated(x) | duplicated(x, fromLast = TRUE)
}
testDF %>%
filter(!is.na(Y) | !all_duplicates(X) )
Upvotes: 3
Reputation: 39174
We can arrange the columns, group by X
, and then slice
the first row. By doing this, we can get the non-NA row in Y
if that row exists.
library(dplyr)
testDF %>%
arrange(X, Y) %>%
group_by(X) %>%
slice(1) %>%
ungroup()
# # A tibble: 12 x 2
# X Y
# <int> <int>
# 1 1 1
# 2 2 2
# 3 3 3
# 4 4 5
# 5 5 NA
# 6 6 7
# 7 7 8
# 8 8 9
# 9 9 11
# 10 10 12
# 11 11 13
# 12 12 14
Upvotes: 2