Pairwise Comparison of Rows in R

Question

I have a dataset that contains results for many tests across many samples. The samples are replicated within the dataset. I would like to compare the test results between replicates within each group of replicated samples. I thought it might be easiest to first split my data frame by the SampleID so that I have a list of data frames, one data frame for each SampleID. There could be 2, 3, 4, or even 5 replicates of a sample so the number of unique combinations of rows to compare for each sample group is not the same. I have the logic that I am thinking laid out below. I want to run a function on the list of data frames and output the match results. The function would compare unique sets of 2 rows within each group of replicated samples and return values of "Match", "Mismatch", or NA (if one or both values for a test is missing). It would also return the count of tests that overlapped between the 2 compared replicates, the number of matches, and the number of mismatches. Lastly, it would include a column where the sample names are pasted together with their row numbers so I know which two samples were compared (ex. Sample1.1_Sample1.2). Could anyone point me in the right direction?

    #Input data structure
    data = as.data.frame(cbind(rbind("Sample1","Sample1","Sample2","Sample2","Sample2"),rbind("A","A","C","C","C"), rbind("A","T","C","C","C"), 
                 rbind("A",NA,"C","C","C"), rbind("A","A","C","C","C"), rbind("A","T","C","C",NA), rbind("A","A","C","C","C"),
                 rbind("A","A","C","C","C"), rbind("A",NA,"C","T","T"), rbind("A","A","C","C","C"), rbind("A","A","C","C","C")))

    colnames(data) = c("SampleID", "Test1","Test2","Test3","Test4","Test5","Test6","Test7","Test8","Test9","Test10")
    data 

    data.split = split(data, data$SampleID)


    ##Row comparison function
    #Input is a list of data frames. Each data frame contains results for replicates of the same sample.
    RowCompare = function(x){
      rowcount = nrow(x)
      ##ifelse(rowcount==2,
        ##compare row 1 to row 2
          ##paste sample names being compared together
          ##how many non-NA values overlap, keep value
          ##of those that overlap, how many match, keep value
          ##of those that overlap, how many do not match, keep value
      #ifelse(rowcount==3,
          ##compare row 1 to row 2
            ##paste sample names being compared together
            ##how many non-NA values overlap, keep value
            ##of those that overlap, how many match, keep value
            ##of those that overlap, how many do not match, keep value
          ##compare row 1 to row 3
            ##paste sample names being compared together
            ##how many non-NA values overlap, keep value
            ##of those that overlap, how many match, keep value
            ##of those that overlap, how many do not match, keep value
          ##compare row 2 to row 3
            ##paste sample names being compared together
            ##how many non-NA values overlap, keep value
            ##of those that overlap, how many match, keep value
            ##of those that overlap, how many do not match, keep value
      return(results)
    }

    #Output is a list of data frames - one for sample name
    out = lapply(names(data.split), function(x) RowCompare(data.split[[x]])) 

    #Row bind the list of data frames back together to one large data frame
    out.merge = do.call(rbind.data.frame, out) 
    head(out.merge)

    #Desired output
    out.merge = as.data.frame(cbind(rbind("Sample1.1_Sample1.2","Sample2.1_Sample2.2","Sample2.1_Sample2.3","Sample2.2_Sample2.3"),rbind("Match","Match","Match","Match"), 
                      rbind("Mismatch","Match","Match","Match"), rbind(NA,"Match","Match","Match"), rbind("Match","Match","Match","Match"), rbind("Mismatch","Match",NA,NA), 
                      rbind("Match","Match","Match","Match"), rbind("Match","Match","Match","Match"), rbind(NA,"Mismatch","Mismatch","Match"), rbind("Match","Match","Match","Match"), 
                      rbind("Match","Match","Match","Match"), rbind(8,10,9,9), rbind(6,9,8,8), rbind(2,1,1,1)))

    colnames(out.merge) = c("SampleID", "Test1","Test2","Test3","Test4","Test5","Test6","Test7","Test8","Test9","Test10", "Num_Overlap", "Num_Match","Num_Mismatch")
    out.merge

One thing I did see on another post that I thought might be useful is the line below which would create a data frame of unique row combinations that could then be used to define which rows to compare in each group of replicated samples. Not sure how to implement it though.

    t(combn(nrow(data),2))

Thank you.

C8H10N4O2 · Accepted Answer

You are on the right track with t(combn(nrow(data),2)). See below for how I would do it.

testCols <- which(grepl("^Test\d+",colnames(data)))

TestsCompare=function(x,y){
  ##how many non-NA values overlap
  overlaps <- sum(!is.na(x) & !is.na(y))
  ##of those that overlap, how many match
  matches <- sum(x==y, na.rm=T)
  ##of those that overlap, how many do not match
  non_matches <- overlaps - matches # complement of matches
  c(overlaps,matches,non_matches)
}

RowCompare= function(x){
  comp <- NULL
  pairs <- t(combn(nrow(x),2))
  for(i in 1:nrow(pairs)){
    row_a <- pairs[i,1]
    row_b <- pairs[i,2]
    a_tests <- x[row_a,testCols]
    b_tests <- x[row_b,testCols]
    comp <- rbind(comp, c(row_a, row_b, TestsCompare(a_tests, b_tests)))
  }
  colnames(comp) <- c("row_a","row_b","overlaps","matches","non_matches")
  return(comp)
}

out = lapply(data.split, RowCompare)

Produces:

> out
$Sample1
     row_a row_b overlaps matches non_matches
[1,]     1     2        8       6           2

$Sample2
     row_a row_b overlaps matches non_matches
[1,]     1     2       10       9           1
[2,]     1     3        9       8           1
[3,]     2     3        9       9           0

Pairwise Comparison of Rows in R

Answers (1)

Related Questions