Kanika Singhal
Kanika Singhal

Reputation: 655

Rows in data frame with duplicated values in certain columns

Suppose a data frame df has 5 columns a1 to a5. I want those rows of df which have common values in columns a1 and a2.

 col=c("a1","a2" )
 df1=df[,which(colnames(df)%in%col)]
 l=which(duplicated(df1)==TRUE)
 df2=df[l, ]

Now df2 contains only duplicate rows. I want to add matching rows with duplicated values.

Sample

  a1=c(1:3,3,4)
  a2=c("a1","a2","a2","a2","a1")
  a3=c(1:5)
  df=data.frame(a1,a2,a3)
  df
   a1 a2 a3
 1  1 a1  1
 2  2 a2  2
 3  3 a2  3
 4  3 a2  4
 5  4 a1  5
  col1=c("a1","a2" )
  df1=df[,which(colnames(df)%in%col1)]
  l=which(duplicated(df1)==TRUE)
  df2=df[l, ]
  df2
   a1 a2 a3
 4  3 a2  4

But I want df2 to be

   a1 a2 a3
 3  3 a2  3
 4  3 a2  4

One solution is to search in df.. but my df has around 10M rows.. so this solution is efficient

Upvotes: 0

Views: 94

Answers (2)

C8H10N4O2
C8H10N4O2

Reputation: 19005

You might also try this data.table solution for speed. I think using the keys might be faster, depending on your table size vs. the overhead of the keys:

library(data.table)

dt <- data.table(a1,a2,a3)

setkey(dt, a1, a2)

dups <- which(duplicated(dt))

dt[dt[dups, list(a1, a2)]]

edit: please see Arun's comment below for an expert data.table formulation

Upvotes: 0

cccmir
cccmir

Reputation: 1003

try instead of

 l=which(duplicated(df1)==TRUE)

use

l=which(duplicated(df1) | duplicated(df1,fromLast = T))

that should work for you

Upvotes: 1

Related Questions