user2814482
user2814482

Reputation: 651

How to find duplicate ids and compare rows

I have a very large dataframe where pairs is the id, and each pair has a correlation. There are duplicate rows where the same pair might have a different correlation.

I want to plot corr1 versus corr2 for the same pair. So I thought I might have to make a new dataframe frist them plot corr1 verus corr2.

For example.

df1

  pair rev_pair corr
1  A:B      B:A  0.5
2  B:C      C:B  0.9
3  C:D      D:C  0.5
4  E:F      F:E  1.0
5  A:B      B:A  0.6
6  B:C      C:B  1.0

df_pairs

pair  corr1  corr2
A:B    0.5    0.6
B:C    0.9    1.0

Here is my attempted script. But it is taking forever. Does anyone have any suggestions.

Thank you,

i<-1                    
max_dup_pairs<-(length(Corr_dL[[i]]$corr)+length(Corr_sLdP[[i]]$corr)+length(Corr_sLsP[[i]]$corr))    
dup_pairs<-matrix(nrow=max_dup_pairs,ncol=4)
counter<-1
for (j in 1:dim(ScoreCorr_list[[i]])[1]){ # for each row
for (k in (j+1):dim(ScoreCorr_list[[i]])[1]){ # compare to rows below
    if (ScoreCorr_list[[i]][j,]$pair==ScoreCorr_list[[i]][k,]$pair){
        dup_pairs[i,1]<-ScoreCorr_list[[i]][j,]$corr
        dup_pairs[i,2]<-ScoreCorr_list[[i]][k,]$corr
        dup_pairs[i,3]<-ScoreCorr_list[[i]][j,]$pair
        dup_pairs[i,4]<-ScoreCorr_list[[i]][k,]$pair

    }   
 }  

}
ind <- apply(dup_pairs, 1, function(x) all(is.na(x)))
dup_pairs <- dup_pairs[ !ind, ]
pdf("Corr.duplicatePairsCorr.xy.pdf",width=6.07,height=3.75)
plot(dup_pairs[,1],dup_pairs[,2])
dev.off()

Upvotes: 0

Views: 75

Answers (2)

akrun
akrun

Reputation: 887058

If you need to reshape df1 to df_pairs

  library(reshape2)
  df1$indx <- with(df1, paste0("corr", ave(seq_along(pair), pair, FUN=seq_along)) )
  df_pairs <- dcast(df1, pair~indx, value.var="corr")
   df_pairs[!rowSums(is.na(df_pairs[,-1])),]
  #  pair corr1 corr2
  #1  A:B   0.5   0.6
  #2  B:C   0.9   1.0

You may also use data.table

  library(data.table)
  key1 <- c("pair", "rev_pair")
  DT <- data.table(df1, key=key1)
   dcast(DT[DT[duplicated(DT)],][,N:= paste0("corr",1:.N), by= key1], pair~N, value.var="corr")
  #  pair corr1 corr2
  #1  A:B   0.5   0.6
  #2  B:C   0.9   1.0

Upvotes: 1

agstudy
agstudy

Reputation: 121568

You can use aggregate after removing duplicated values.

dat_dup <- 
dat[duplicated(dat[,1:2]) |  duplicated(dat[,1:2],fromLast = TRUE) ,]
aggregate(corr~pair,data=dat_dup,FUN=I)

   pair corr.1 corr.2
1  A:B    0.5    0.6
2  B:C    0.9    1.0

Upvotes: 1

Related Questions