Rami Al-Fahham
Rami Al-Fahham

Reputation: 627

How to optimize my R code to eliminate NEIGHBORING duplicates row-wise in a dataframe using vectorization instead of looping

Edited question:

My dataframe looks like this.

x1 <- c("a", "c", "f", "j")
x2 <- c("b", "c", "g", "k")
x3 <- c("b", "d", "h", NA)
x4 <- c("a", "e", "i", NA)
df <- data.frame(x1, x2, x3, x4, stringsAsFactors=F)

df

  x1 x2   x3   x4
1  a  b    b    a
2  c  c    d    e
3  f  g    h    i
4  j  k <NA> <NA>

I wrote a loop to eliminate NEIGHBORING duplicate values in each row.

for ( i in 1:4 ) {

   for ( j in 1:3 ) {

     if ( df[i, 4-j+1] == df[i, 4-j] & is.na(df[i, 4-j+1]) == F ) {

       df[i, 4-j+1] <- NA

     } else { 

       df[i, 4-j+1] <- df[i, 4-j+1]
     }
   }
}

The result looks like this.

  x1   x2   x3   x4
1  a    b <NA>    a
2  c <NA>    d    e
3  f    g    h    i
4  j    k <NA> <NA>

However, the original dataframe is quite big so the loop doesn't seem to be an appropriate approach.

Could you please show me how to optimize?

Thank you very much for your help and sorry for not asking more precisely.

Rami

Upvotes: 1

Views: 101

Answers (1)

Cath
Cath

Reputation: 24074

To remove duplicates wherever there are on the row

df[t(apply(df,1,duplicated))]<-NA

To remove only neighbouring duplicates, this should work :

df[]<-t(apply(df,1,function(rg){
            if(any(duplicated(rg))) {
                inddupl<-c(F,rg[2:length(rg)]==rg[1:(length(rg)-1)])
                rg[inddupl]<-NA
            }
            return(rg)
                   }))

Upvotes: 4

Related Questions