Sylvia Rodriguez
Sylvia Rodriguez

Reputation: 1353

Removing/collapsing duplicate rows in R

I am using the following R code, which I copied from elsewhere (https://support.bioconductor.org/p/70133/). Seems to work great for what I hope to do (which is remove/collapse duplicates from a dataset), but I do not understand the last line. I would like to know on what basis the duplicates are removed/collapsed. It was commented it was based on the median absolute deviation (MAD), but I am not following that. Could anyone help me understand this, please?

 Probesets=paste("a",1:200,sep="")
 Genes=sample(letters,200,replace=T)
 Value=rnorm(200)
 X=data.frame(Probesets,Genes,Value)
 X=X[order(X$Value,decreasing=T),]
 Y=X[which(!duplicated(X$Genes)),]

Upvotes: 0

Views: 685

Answers (2)

hello_friend
hello_friend

Reputation: 5788

Your code is keeping the records containing maximum value per gene.

Upvotes: 1

Chris Ruehlemann
Chris Ruehlemann

Reputation: 21400

Are you sure you want to remove those rows where the Genesvalues are duplicated? That's at least what this code does:

Y=X[which(!duplicated(X$Genes)),]

Thus, Ycontains only unique Genesvalues. If you compare nrow(Y)and length(unique(X$Genes))you will see that the result is the same:

nrow(Y); length(unique(X$Genes))
[1] 26
[1] 26

If you want to remove rows that contain duplicate values across all columns, which is arguably the definition of a duplicate row, then you can do this:

Y=X[!duplicated(X),]

To see how it works consider this example:

df <- data.frame(
  a = c(1,1,2,3),
  b = c(1,1,3,4)
)
df
  a b
1 1 1
2 1 1
3 2 3
4 3 4

df[!duplicated(df),]
  a b
1 1 1
3 2 3
4 3 4

Upvotes: 1

Related Questions