Reputation: 627
Edited question:
My dataframe looks like this.
x1 <- c("a", "c", "f", "j")
x2 <- c("b", "c", "g", "k")
x3 <- c("b", "d", "h", NA)
x4 <- c("a", "e", "i", NA)
df <- data.frame(x1, x2, x3, x4, stringsAsFactors=F)
df
x1 x2 x3 x4
1 a b b a
2 c c d e
3 f g h i
4 j k <NA> <NA>
I wrote a loop to eliminate NEIGHBORING duplicate values in each row.
for ( i in 1:4 ) {
for ( j in 1:3 ) {
if ( df[i, 4-j+1] == df[i, 4-j] & is.na(df[i, 4-j+1]) == F ) {
df[i, 4-j+1] <- NA
} else {
df[i, 4-j+1] <- df[i, 4-j+1]
}
}
}
The result looks like this.
x1 x2 x3 x4
1 a b <NA> a
2 c <NA> d e
3 f g h i
4 j k <NA> <NA>
However, the original dataframe is quite big so the loop doesn't seem to be an appropriate approach.
Could you please show me how to optimize?
Thank you very much for your help and sorry for not asking more precisely.
Rami
Upvotes: 1
Views: 101
Reputation: 24074
To remove duplicates wherever there are on the row
df[t(apply(df,1,duplicated))]<-NA
To remove only neighbouring duplicates, this should work :
df[]<-t(apply(df,1,function(rg){
if(any(duplicated(rg))) {
inddupl<-c(F,rg[2:length(rg)]==rg[1:(length(rg)-1)])
rg[inddupl]<-NA
}
return(rg)
}))
Upvotes: 4