Reputation: 233
I want to check which rows of the matrix or dataframe are duplicate, how can we find it?
We want to remove duplicate rows. Duplicate rows are rows which have the same values in both columns 1 and 2 by ignoring their ordering.
For example, for the following matrix:
Col1 Col2 database
A B IntAct
A B Bind
B A BioGrid
I want to have only one of the rows.
Col1 Col2 database
A B IntAct
Upvotes: 2
Views: 895
Reputation: 13570
A wordy alternative using sqldf
:
First row
We create an id column to get the first ocurrence.
library(sqldf)
df$id <- seq.int(nrow(df))
sqldf("select col1, col2, database, min(id) id
from (select col1, col2, database, id from df where col1 <= col2
union all
select col2 col1, col1 col2, database, id from df where col1 > col2)
group by col1, col2")
Output:
col1 col2 database id
1 A B IntAct 1
Last row (third in this example)
A more concise option proposed by G. Grothendieck
sqldf("select col1, col2, database
from (select col1, col2, database from df where col1 <= col2
union all
select col2 col1, col1 col2, database from df where col1 > col2)
group by col1, col2")
Output:
col1 col2 database
1 A B BioGrid
Upvotes: 4
Reputation: 886938
Here is another option using pmax/pmin
library(data.table)
setDT(df1)[!duplicated(pmin(Col1, Col2), pmax(Col1, Col2))]
# Col1 Col2 database
#1: A B IntAct
Benchmarking with bigger data:
# dummy data
set.seed(123)
df <- data.frame(Col1 = sample(c("A", "B", "C"), 1000, replace = TRUE),
Col2 = sample(c("A", "B", "C"), 1000, replace = TRUE),
database = sample(c("IntAct", "Bind", "BioGrid"), 1000,
replace = TRUE), stringsAsFactors = FALSE)
# benchmark
microbenchmark::microbenchmark(
t = df[ !duplicated(t(apply(df[, 1:2], 1, sort))), ] ,
paste = df[ !duplicated(apply(df[, 1:2], 1,
function(i)paste(sort(i), collapse = ","))), ],
pmin = df[ !duplicated(cbind(pmin(df[, 1], df[, 2]), pmax(df[, 1], df[, 2]))), ],
times = 1000)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# t 33.49008 36.337253 38.374825 37.420015 39.610627 153.89251 1000 b
# paste 33.24177 36.102055 38.079015 37.330498 39.465803 151.43734 1000 b
# pmin 2.59116 2.790864 3.034999 2.910316 3.137389 11.99905 1000 a
Upvotes: 5
Reputation: 56004
Paste columns together with separator, then use duplicated:
df[ !duplicated(apply(df[, 1:2], 1, function(i)paste(sort(i), collapse = ","))), ]
Upvotes: 3
Reputation: 42629
d[!(duplicated(d[,1:2]) | rev(duplicated(d[rev(rownames(d)), 1:2]))),]
Col1 Col2 database
1 A B IntAct
duplicated
shows rows that match one with a lower index. That's not quite enough, but applying it on the data frame both top-to-bottom and bottom-to-top gives what you want.
Upvotes: 3