Reputation: 655
Suppose a data frame df has 5 columns a1 to a5. I want those rows of df which have common values in columns a1 and a2.
col=c("a1","a2" )
df1=df[,which(colnames(df)%in%col)]
l=which(duplicated(df1)==TRUE)
df2=df[l, ]
Now df2
contains only duplicate rows. I want to add matching rows with duplicated values.
Sample
a1=c(1:3,3,4)
a2=c("a1","a2","a2","a2","a1")
a3=c(1:5)
df=data.frame(a1,a2,a3)
df
a1 a2 a3
1 1 a1 1
2 2 a2 2
3 3 a2 3
4 3 a2 4
5 4 a1 5
col1=c("a1","a2" )
df1=df[,which(colnames(df)%in%col1)]
l=which(duplicated(df1)==TRUE)
df2=df[l, ]
df2
a1 a2 a3
4 3 a2 4
But I want df2
to be
a1 a2 a3
3 3 a2 3
4 3 a2 4
One solution is to search in df.. but my df has around 10M rows.. so this solution is efficient
Upvotes: 0
Views: 94
Reputation: 19005
You might also try this data.table
solution for speed. I think using the keys might be faster, depending on your table size vs. the overhead of the keys:
library(data.table)
dt <- data.table(a1,a2,a3)
setkey(dt, a1, a2)
dups <- which(duplicated(dt))
dt[dt[dups, list(a1, a2)]]
edit: please see Arun's comment below for an expert data.table
formulation
Upvotes: 0
Reputation: 1003
try instead of
l=which(duplicated(df1)==TRUE)
use
l=which(duplicated(df1) | duplicated(df1,fromLast = T))
that should work for you
Upvotes: 1