Find the duplicated rows of a dataframe and which original row the duplicated row is corresponding in R

Question

My data frame looks like:

 data <- data.frame(a=c(3,1,2,2,2,3),b=c(3,1,1,2,2,3))

 duplicated(data)

 [1] FALSE FALSE FALSE FALSE  TRUE  TRUE

What I want is not only the logical string to indicate which row is duplicating, but also which original row the duplicated row is corresponding. In the example above, the fifth row is the duplicate of the fourth row in the original dataframe and the sixth row is the duplicate of the first row in the original dataframe. So I want a index vector like:

   NA NA NA NA 4 1

(NA indicates non-duplicating row).

My naive approach is:

  dupTF <- duplicated(data)
  DupDat <- data[dupTF,]
  index0 <- rep(NA,nrow(DupDat))
  for (i in 1 : nrow(DupDat))
  {
     for (j in 1 : nrow(data))
        {
          if(all(data[j,] == DupDat[i,])) break;
        }
       index0[i] <- j
   }
  index <- rep(NA,length(dupTF))
  index[dupTF]<- index0
  index
  [1] NA NA NA NA  4  1

However, this approach is not ideal because it go through loop over all the data...

Josh O&#39;Brien · Accepted Answer

I'd likely use data.table, since its .I and .N variables (available from within each by group) make this so straightforward:

library(data.table)
dt <- data.table(data)
dt[, XX:=c(NA, rep(.I[1], .N-1)), by=c("a","b")][,XX]
# [1] NA NA NA NA  4  1

Find the duplicated rows of a dataframe and which original row the duplicated row is corresponding in R

Answers (1)

Related Questions