Reputation: 2614
My data frame looks like:
data <- data.frame(a=c(3,1,2,2,2,3),b=c(3,1,1,2,2,3))
duplicated(data)
[1] FALSE FALSE FALSE FALSE TRUE TRUE
What I want is not only the logical string to indicate which row is duplicating, but also which original row the duplicated row is corresponding. In the example above, the fifth row is the duplicate of the fourth row in the original dataframe and the sixth row is the duplicate of the first row in the original dataframe. So I want a index vector like:
NA NA NA NA 4 1
(NA indicates non-duplicating row).
My naive approach is:
dupTF <- duplicated(data)
DupDat <- data[dupTF,]
index0 <- rep(NA,nrow(DupDat))
for (i in 1 : nrow(DupDat))
{
for (j in 1 : nrow(data))
{
if(all(data[j,] == DupDat[i,])) break;
}
index0[i] <- j
}
index <- rep(NA,length(dupTF))
index[dupTF]<- index0
index
[1] NA NA NA NA 4 1
However, this approach is not ideal because it go through loop over all the data...
Upvotes: 0
Views: 179
Reputation: 162311
I'd likely use data.table, since its .I
and .N
variables (available from within each by
group) make this so straightforward:
library(data.table)
dt <- data.table(data)
dt[, XX:=c(NA, rep(.I[1], .N-1)), by=c("a","b")][,XX]
# [1] NA NA NA NA 4 1
Upvotes: 2