Jakab Zalán
Jakab Zalán

Reputation: 23

Finding duplicate elements in multiple pairs of columns in R

New to R and to programming. This might be an easy question. I'm trying to find duplicate elements in certain pairs of columns, and replace both the original and the duplicate with N/A. So if I have the following dataset:

mydf <- structure(list(V1 = c(1, 2, 3, 1, 3, 2) V2 = c("zz", "aa", "bb", "zz", "yy", 
"ii"), V3 = c("aa", "ff", "aa", "hh", "cc", "jj"), V4 = c("ee", 
"xx", "ee", "hh", "dd", "kk"), V5 = c(213L, 254L, 235L, 356L, 
796L, 954L)), class = "data.frame", row.names = c(NA, -6L))

  V1 V2 V3 V4  V5
1  1 zz aa ee 213
2  2 aa ff xx 254
3  3 bb aa ee 235
4  1 zz hh hh 356
5  3 yy cc dd 796
6  2 ii jj kk 954

I'd like to find rows that are duplicate either in V1 and V2, or in V3 and V4. So the final result would look like this:

    V1   V2   V3   V4  V5
1   N/A  N/A  N/A  N/A 213
2    2   aa   ff   xx  254
3    3   bb   N/A  N/A 235
4   N/A  N/A  hh   hh  356
5    3   yy   cc   dd  796
6    2   ii   jj   kk  954

Upvotes: 1

Views: 108

Answers (1)

Ronak Shah
Ronak Shah

Reputation: 388807

You can check for duplicated rows in different columns and replace it with NA.

cols1 <- c('V1', 'V2')
cols2 <- c('V3', 'V4')

mydf[cols1][duplicated(mydf[cols1]) | duplicated(mydf[cols1], fromLast = TRUE),] <- NA
mydf[cols2][duplicated(mydf[cols2]) | duplicated(mydf[cols2], fromLast = TRUE),] <- NA

mydf
# V1   V2   V3   V4  V5
#1 NA <NA> <NA> <NA> 213
#2  2   aa   ff   xx 254
#3  3   bb <NA> <NA> 235
#4 NA <NA>   hh   hh 356
#5  3   yy   cc   dd 796
#6  2   ii   jj   kk 954

Upvotes: 0

Related Questions