Reputation: 23
New to R and to programming. This might be an easy question. I'm trying to find duplicate elements in certain pairs of columns, and replace both the original and the duplicate with N/A. So if I have the following dataset:
mydf <- structure(list(V1 = c(1, 2, 3, 1, 3, 2) V2 = c("zz", "aa", "bb", "zz", "yy",
"ii"), V3 = c("aa", "ff", "aa", "hh", "cc", "jj"), V4 = c("ee",
"xx", "ee", "hh", "dd", "kk"), V5 = c(213L, 254L, 235L, 356L,
796L, 954L)), class = "data.frame", row.names = c(NA, -6L))
V1 V2 V3 V4 V5
1 1 zz aa ee 213
2 2 aa ff xx 254
3 3 bb aa ee 235
4 1 zz hh hh 356
5 3 yy cc dd 796
6 2 ii jj kk 954
I'd like to find rows that are duplicate either in V1 and V2, or in V3 and V4. So the final result would look like this:
V1 V2 V3 V4 V5
1 N/A N/A N/A N/A 213
2 2 aa ff xx 254
3 3 bb N/A N/A 235
4 N/A N/A hh hh 356
5 3 yy cc dd 796
6 2 ii jj kk 954
Upvotes: 1
Views: 108
Reputation: 388807
You can check for duplicated
rows in different columns and replace it with NA
.
cols1 <- c('V1', 'V2')
cols2 <- c('V3', 'V4')
mydf[cols1][duplicated(mydf[cols1]) | duplicated(mydf[cols1], fromLast = TRUE),] <- NA
mydf[cols2][duplicated(mydf[cols2]) | duplicated(mydf[cols2], fromLast = TRUE),] <- NA
mydf
# V1 V2 V3 V4 V5
#1 NA <NA> <NA> <NA> 213
#2 2 aa ff xx 254
#3 3 bb <NA> <NA> 235
#4 NA <NA> hh hh 356
#5 3 yy cc dd 796
#6 2 ii jj kk 954
Upvotes: 0