Removing/collapsing duplicate rows in R

Question

I am using the following R code, which I copied from elsewhere (https://support.bioconductor.org/p/70133/). Seems to work great for what I hope to do (which is remove/collapse duplicates from a dataset), but I do not understand the last line. I would like to know on what basis the duplicates are removed/collapsed. It was commented it was based on the median absolute deviation (MAD), but I am not following that. Could anyone help me understand this, please?

 Probesets=paste("a",1:200,sep="")
 Genes=sample(letters,200,replace=T)
 Value=rnorm(200)
 X=data.frame(Probesets,Genes,Value)
 X=X[order(X$Value,decreasing=T),]
 Y=X[which(!duplicated(X$Genes)),]

Chris Ruehlemann · Accepted Answer

Are you sure you want to remove those rows where the Genesvalues are duplicated? That's at least what this code does:

Y=X[which(!duplicated(X$Genes)),]

Thus, Ycontains only unique Genesvalues. If you compare nrow(Y)and length(unique(X$Genes))you will see that the result is the same:

nrow(Y); length(unique(X$Genes))
[1] 26
[1] 26

If you want to remove rows that contain duplicate values across all columns, which is arguably the definition of a duplicate row, then you can do this:

Y=X[!duplicated(X),]

To see how it works consider this example:

df <- data.frame(
  a = c(1,1,2,3),
  b = c(1,1,3,4)
)
df
  a b
1 1 1
2 1 1
3 2 3
4 3 4

df[!duplicated(df),]
  a b
1 1 1
3 2 3
4 3 4

Removing/collapsing duplicate rows in R

Answers (2)

Related Questions