Jakab Zalán
Jakab Zalán

Reputation: 23

Find duplicate elements across rows by selected columns in R

I am trying to get a new data frame with all the recurring elements across a number of selected columns. So, for example, if I have the following df I'd like to run a search for duplicate elements by columns V2, V3, V4.

V1  V2  V3  V4  V5
1   zz  aa  ee  213
2   aa  ff  xx  254
3   bb  hh  gg  235
4   zz  hh  hh  356
5   yy  cc  dd  796
6   ii  jj  kk  954

I would like to get all the rows that contained at least an element that is a duplicate from another of the selected columns or rows. Moreover, the duplicate function in R only select the duplicates, leaving out the 'originals', which I'd also like to include.

So the final df would look like this:

V1  V2  V3  V4  V5
1   zz  aa  ee  213
2   aa  ff  xx  254
3   bb  hh  gg  235
4   zz  hh  hh  356

Upvotes: 1

Views: 162

Answers (4)

jazzurro
jazzurro

Reputation: 23574

If I am reading your question correctly, the following is one way to go. Your data is called mydf. I chose columns that have character. (You may have factors.) Then, I created a character vector and checked which elements have duplication and obtained unique values, which is called dupes. Using this, I subsetted mydf. If any element in V2, V3, and V4 contains one of the elements in dupes, R returned rows.

library(dplyr)

unlist(select_if(mydf, is.character)) %>%
.[duplicated(.)] %>% 
unique -> dupes

filter_at(mydf, vars(V2:V4), any_vars(. %in% dupes))

#  V1 V2 V3 V4  V5
#1  1 zz aa ee 213
#2  2 aa ff xx 254
#3  3 bb hh gg 235
#4  4 zz hh hh 356

DATA

mydf <- structure(list(V1 = 1:6, V2 = c("zz", "aa", "bb", "zz", "yy", 
"ii"), V3 = c("aa", "ff", "hh", "hh", "cc", "jj"), V4 = c("ee", 
"xx", "gg", "hh", "dd", "kk"), V5 = c(213L, 254L, 235L, 356L, 
796L, 954L)), class = "data.frame", row.names = c(NA, -6L))

Upvotes: 3

GKi
GKi

Reputation: 39657

A variant @Ronak-Shah's solution using colSums and apply instead of Reduce and lapply.

cols <- c('V2', 'V3', 'V4')
vals <- unlist(df[cols])
vals <- unique(vals[duplicated(vals)])
df[colSums(apply(df[cols], 1, "%in%", vals)) > 0,]
#  V1 V2 V3 V4  V5
#1  1 zz aa ee 213
#2  2 aa ff xx 254
#3  3 bb hh gg 235
#4  4 zz hh hh 356

Upvotes: 1

ThomasIsCoding
ThomasIsCoding

Reputation: 101343

Here is a base R solution

dfout <- subset(df,apply(df[2:4], 1, function(x) any(x %in% unique((z<-unlist(df[2:4]))[duplicated(z)]))))

such that

> dfout
  V1 V2 V3 V4  V5
1  1 zz aa ee 213
2  2 aa ff xx 254
3  3 bb hh gg 235
4  4 zz hh hh 356

DATA

df <- structure(list(V1 = 1:6, V2 = c("zz", "aa", "bb", "zz", "yy", 
"ii"), V3 = c("aa", "ff", "hh", "hh", "cc", "jj"), V4 = c("ee", 
"xx", "gg", "hh", "dd", "kk"), V5 = c(213L, 254L, 235L, 356L, 
796L, 954L)), class = "data.frame", row.names = c(NA, -6L))

Upvotes: 0

Ronak Shah
Ronak Shah

Reputation: 388982

Using base R, we can unlist the selected columns (cols), get the duplicated values and select rows that have those values.

cols <- c('V2', 'V3', 'V4')

vals <- unlist(df[cols])
df[Reduce(`|`, lapply(df[cols], `%in%`, vals[duplicated(vals)])), ]

#  V1 V2 V3 V4  V5
#1  1 zz aa ee 213
#2  2 aa ff xx 254
#3  3 bb hh gg 235
#4  4 zz hh hh 356

data

df <- structure(list(V1 = 1:6, V2 = structure(c(5L, 1L, 2L, 5L, 4L, 
3L), .Label = c("aa", "bb", "ii", "yy", "zz"), class = "factor"), 
V3 = structure(c(1L, 3L, 4L, 4L, 2L, 5L), .Label = c("aa", 
"cc", "ff", "hh", "jj"), class = "factor"), V4 = structure(c(2L, 
6L, 3L, 4L, 1L, 5L), .Label = c("dd", "ee", "gg", "hh", "kk", 
"xx"), class = "factor"), V5 = c(213L, 254L, 235L, 356L, 
796L, 954L)), class = "data.frame", row.names = c(NA, -6L))

Upvotes: 2

Related Questions