Reputation: 437
I have a data frame of correlation coefficients like the following. In the data frame it has correlation coefficients of a*b
and b*a
which are the same. How do I remove this duplicates? Can anyone please help
**Var1, Var2, r**
ApoA1.ng.ml.1, Apo.B.ng.ml, 0.9998438
Apo.B.ng.ml, ApoA1.ng.ml.1, 0.9998438
SLM.T0., TBW.T0., 0.9992563
TBW.T0., SLM.T0., 0.9992563
Insulin.mercdiaConc..U.L, Insulin..pg.ml, 0.9313702
Insulin..pg.ml, Insulin.mercdiaConc..U.L, 0.9313702
Upvotes: 0
Views: 69
Reputation: 160407
If the other techniques don't quite work, you can use temporary min/max strings and de-duplicated
from those:
x <- read.csv(stringsAsFactors=FALSE, text="
Var1,Var2,r
ApoA1.ng.ml.1,Apo.B.ng.ml,0.9998438
Apo.B.ng.ml,ApoA1.ng.ml.1,0.9998438
SLM.T0.,TBW.T0.,0.9992563
TBW.T0.,SLM.T0.,0.9992563
Insulin.mercdiaConc..U.L,Insulin..pg.ml,0.9313702
Insulin..pg.ml,Insulin.mercdiaConc..U.L,0.9313702")
x[!duplicated(pmin(x$Var1, x$Var2),pmax(x$Var1, x$Var2)),]
# Var1 Var2 r
# 1 ApoA1.ng.ml.1 Apo.B.ng.ml 0.9998438
# 3 SLM.T0. TBW.T0. 0.9992563
# 5 Insulin.mercdiaConc..U.L Insulin..pg.ml 0.9313702
(You can also assign them temporarily to columns in the frame, ala
x$m1 <- pmin(x$Var1, x$Var2)
x$m2 <- pmax(x$Var1, x$Var2)
x[!duplicated(x[c("m1","m2")]),]
though you then have to remove the temp variables yourself.)
Upvotes: 2
Reputation: 520888
We could try using the sqldf
package here:
library(sqldf)
sql <- "SELECT MIN(Var1, Var2), MAX(Var2, Var1), MAX(r) AS R
FROM df
GROUP BY MIN(Var1, Var2), MAX(Var2, Var1)"
df_out <- sqldf(sql)
Upvotes: 2