How to optimize my R code to eliminate NEIGHBORING duplicates row-wise in a dataframe using vectorization instead of looping

Question

Edited question:

My dataframe looks like this.

x1 <- c("a", "c", "f", "j")
x2 <- c("b", "c", "g", "k")
x3 <- c("b", "d", "h", NA)
x4 <- c("a", "e", "i", NA)
df <- data.frame(x1, x2, x3, x4, stringsAsFactors=F)

df

  x1 x2   x3   x4
1  a  b    b    a
2  c  c    d    e
3  f  g    h    i
4  j  k

I wrote a loop to eliminate NEIGHBORING duplicate values in each row.

for ( i in 1:4 ) {

   for ( j in 1:3 ) {

     if ( df[i, 4-j+1] == df[i, 4-j] & is.na(df[i, 4-j+1]) == F ) {

       df[i, 4-j+1] <- NA

     } else { 

       df[i, 4-j+1] <- df[i, 4-j+1]
     }
   }
}

The result looks like this.

  x1   x2   x3   x4
1  a    b     a
2  c     d    e
3  f    g    h    i
4  j    k

However, the original dataframe is quite big so the loop doesn't seem to be an appropriate approach.

Could you please show me how to optimize?

Thank you very much for your help and sorry for not asking more precisely.

Rami

Cath · Accepted Answer

To remove duplicates wherever there are on the row

df[t(apply(df,1,duplicated))]<-NA

To remove only neighbouring duplicates, this should work :

df[]<-t(apply(df,1,function(rg){
            if(any(duplicated(rg))) {
                inddupl<-c(F,rg[2:length(rg)]==rg[1:(length(rg)-1)])
                rg[inddupl]<-NA
            }
            return(rg)
                   }))

How to optimize my R code to eliminate NEIGHBORING duplicates row-wise in a dataframe using vectorization instead of looping

Answers (1)

Related Questions