Reputation: 23
I am trying to get a new data frame with all the recurring elements across a number of selected columns. So, for example, if I have the following df I'd like to run a search for duplicate elements by columns V2, V3, V4.
V1 V2 V3 V4 V5
1 zz aa ee 213
2 aa ff xx 254
3 bb hh gg 235
4 zz hh hh 356
5 yy cc dd 796
6 ii jj kk 954
I would like to get all the rows that contained at least an element that is a duplicate from another of the selected columns or rows. Moreover, the duplicate function in R only select the duplicates, leaving out the 'originals', which I'd also like to include.
So the final df would look like this:
V1 V2 V3 V4 V5
1 zz aa ee 213
2 aa ff xx 254
3 bb hh gg 235
4 zz hh hh 356
Upvotes: 1
Views: 162
Reputation: 23574
If I am reading your question correctly, the following is one way to go. Your data is called mydf
. I chose columns that have character. (You may have factors.) Then, I created a character vector and checked which elements have duplication and obtained unique values, which is called dupes
. Using this, I subsetted mydf
. If any element in V2, V3, and V4 contains one of the elements in dupes
, R returned rows.
library(dplyr)
unlist(select_if(mydf, is.character)) %>%
.[duplicated(.)] %>%
unique -> dupes
filter_at(mydf, vars(V2:V4), any_vars(. %in% dupes))
# V1 V2 V3 V4 V5
#1 1 zz aa ee 213
#2 2 aa ff xx 254
#3 3 bb hh gg 235
#4 4 zz hh hh 356
DATA
mydf <- structure(list(V1 = 1:6, V2 = c("zz", "aa", "bb", "zz", "yy",
"ii"), V3 = c("aa", "ff", "hh", "hh", "cc", "jj"), V4 = c("ee",
"xx", "gg", "hh", "dd", "kk"), V5 = c(213L, 254L, 235L, 356L,
796L, 954L)), class = "data.frame", row.names = c(NA, -6L))
Upvotes: 3
Reputation: 39657
A variant @Ronak-Shah's solution using colSums
and apply
instead of Reduce
and lapply
.
cols <- c('V2', 'V3', 'V4')
vals <- unlist(df[cols])
vals <- unique(vals[duplicated(vals)])
df[colSums(apply(df[cols], 1, "%in%", vals)) > 0,]
# V1 V2 V3 V4 V5
#1 1 zz aa ee 213
#2 2 aa ff xx 254
#3 3 bb hh gg 235
#4 4 zz hh hh 356
Upvotes: 1
Reputation: 101343
Here is a base R solution
dfout <- subset(df,apply(df[2:4], 1, function(x) any(x %in% unique((z<-unlist(df[2:4]))[duplicated(z)]))))
such that
> dfout
V1 V2 V3 V4 V5
1 1 zz aa ee 213
2 2 aa ff xx 254
3 3 bb hh gg 235
4 4 zz hh hh 356
DATA
df <- structure(list(V1 = 1:6, V2 = c("zz", "aa", "bb", "zz", "yy",
"ii"), V3 = c("aa", "ff", "hh", "hh", "cc", "jj"), V4 = c("ee",
"xx", "gg", "hh", "dd", "kk"), V5 = c(213L, 254L, 235L, 356L,
796L, 954L)), class = "data.frame", row.names = c(NA, -6L))
Upvotes: 0
Reputation: 388982
Using base R, we can unlist
the selected columns (cols
), get the duplicated
values and select rows that have those values.
cols <- c('V2', 'V3', 'V4')
vals <- unlist(df[cols])
df[Reduce(`|`, lapply(df[cols], `%in%`, vals[duplicated(vals)])), ]
# V1 V2 V3 V4 V5
#1 1 zz aa ee 213
#2 2 aa ff xx 254
#3 3 bb hh gg 235
#4 4 zz hh hh 356
data
df <- structure(list(V1 = 1:6, V2 = structure(c(5L, 1L, 2L, 5L, 4L,
3L), .Label = c("aa", "bb", "ii", "yy", "zz"), class = "factor"),
V3 = structure(c(1L, 3L, 4L, 4L, 2L, 5L), .Label = c("aa",
"cc", "ff", "hh", "jj"), class = "factor"), V4 = structure(c(2L,
6L, 3L, 4L, 1L, 5L), .Label = c("dd", "ee", "gg", "hh", "kk",
"xx"), class = "factor"), V5 = c(213L, 254L, 235L, 356L,
796L, 954L)), class = "data.frame", row.names = c(NA, -6L))
Upvotes: 2