Zaynab
Zaynab

Reputation: 233

Remove duplicate rows of a matrix or dataframe

I want to check which rows of the matrix or dataframe are duplicate, how can we find it?

We want to remove duplicate rows. Duplicate rows are rows which have the same values in both columns 1 and 2 by ignoring their ordering.

For example, for the following matrix:

Col1   Col2     database 
 A       B       IntAct
 A       B       Bind
 B       A       BioGrid

I want to have only one of the rows.

Col1   Col2     database 
 A       B       IntAct

Upvotes: 2

Views: 895

Answers (4)

mpalanco
mpalanco

Reputation: 13570

A wordy alternative using sqldf:

First row
We create an id column to get the first ocurrence.

library(sqldf)
df$id <- seq.int(nrow(df))
sqldf("select col1, col2, database, min(id) id
      from (select col1, col2, database, id from df where col1 <= col2
      union all
      select col2 col1, col1 col2, database, id from df where col1 > col2) 
      group by col1, col2")

Output:

  col1 col2 database id
1    A    B   IntAct  1

Last row (third in this example)
A more concise option proposed by G. Grothendieck

sqldf("select col1, col2, database 
      from (select col1, col2, database from df where col1 <= col2 
      union all 
      select col2 col1, col1 col2, database from df where col1 > col2) 
      group by col1, col2")

Output:

  col1 col2 database
1    A    B  BioGrid

Upvotes: 4

akrun
akrun

Reputation: 886938

Here is another option using pmax/pmin

library(data.table)
setDT(df1)[!duplicated(pmin(Col1, Col2), pmax(Col1, Col2))]
#   Col1 Col2 database
#1:    A    B   IntAct

Benchmarking with bigger data:

# dummy data
set.seed(123)
df <- data.frame(Col1 = sample(c("A", "B", "C"), 1000, replace = TRUE),
                 Col2 = sample(c("A", "B", "C"), 1000, replace = TRUE),
                 database = sample(c("IntAct", "Bind", "BioGrid"), 1000,
                                   replace = TRUE), stringsAsFactors = FALSE)
# benchmark
microbenchmark::microbenchmark(
  t = df[ !duplicated(t(apply(df[, 1:2], 1, sort))), ] ,
  paste = df[ !duplicated(apply(df[, 1:2], 1,
                                function(i)paste(sort(i), collapse = ","))), ],
  pmin = df[ !duplicated(cbind(pmin(df[, 1], df[, 2]), pmax(df[, 1], df[, 2]))), ],
  times = 1000)

# Unit: milliseconds
#   expr      min        lq      mean    median        uq       max neval cld
#      t 33.49008 36.337253 38.374825 37.420015 39.610627 153.89251  1000   b
#  paste 33.24177 36.102055 38.079015 37.330498 39.465803 151.43734  1000   b
#   pmin  2.59116  2.790864  3.034999  2.910316  3.137389  11.99905  1000  a 

Upvotes: 5

zx8754
zx8754

Reputation: 56004

Paste columns together with separator, then use duplicated:

df[ !duplicated(apply(df[, 1:2], 1, function(i)paste(sort(i), collapse = ","))), ]

Upvotes: 3

Matthew Lundberg
Matthew Lundberg

Reputation: 42629

d[!(duplicated(d[,1:2]) | rev(duplicated(d[rev(rownames(d)), 1:2]))),]
  Col1 Col2 database
1    A    B   IntAct

duplicated shows rows that match one with a lower index. That's not quite enough, but applying it on the data frame both top-to-bottom and bottom-to-top gives what you want.

Upvotes: 3

Related Questions