FairyOnIce
FairyOnIce

Reputation: 2614

Find the duplicated rows of a dataframe and which original row the duplicated row is corresponding in R

My data frame looks like:

 data <- data.frame(a=c(3,1,2,2,2,3),b=c(3,1,1,2,2,3))

 duplicated(data)

 [1] FALSE FALSE FALSE FALSE  TRUE  TRUE

What I want is not only the logical string to indicate which row is duplicating, but also which original row the duplicated row is corresponding. In the example above, the fifth row is the duplicate of the fourth row in the original dataframe and the sixth row is the duplicate of the first row in the original dataframe. So I want a index vector like:

   NA NA NA NA 4 1

(NA indicates non-duplicating row).

My naive approach is:

  dupTF <- duplicated(data)
  DupDat <- data[dupTF,]
  index0 <- rep(NA,nrow(DupDat))
  for (i in 1 : nrow(DupDat))
  {
     for (j in 1 : nrow(data))
        {
          if(all(data[j,] == DupDat[i,])) break;
        }
       index0[i] <- j
   }
  index <- rep(NA,length(dupTF))
  index[dupTF]<- index0
  index
  [1] NA NA NA NA  4  1

However, this approach is not ideal because it go through loop over all the data...

Upvotes: 0

Views: 179

Answers (1)

Josh O&#39;Brien
Josh O&#39;Brien

Reputation: 162311

I'd likely use data.table, since its .I and .N variables (available from within each by group) make this so straightforward:

library(data.table)
dt <- data.table(data)
dt[, XX:=c(NA, rep(.I[1], .N-1)), by=c("a","b")][,XX]
# [1] NA NA NA NA  4  1

Upvotes: 2

Related Questions