Perform Genotyping data manipulation

Question

I have this dataframe called mydf, where The REF and ALT columns could have letters "A", "T", "G","C".

In columns A,C,G,T, A is same as T and G is same as C. Meaning whatever we have in A and T should be added together and whatever values we have in G and C should be added together. I want to match the letters in columns A,C,G,T and get the corresponding values in "REF, ALT" format as shown in expected output.

mydf <- structure(
    list(A = c(10, NA, 1, NA, 22, 22), C = c(NA, 11, 11, 46, NA, NA), 
         G = c(NA, 10, 29, 22, NA, NA), T = c(12, NA, NA, NA, 47, 47), 
         N = c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), 
         `=` = c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), 
         REF = c("A", "G", "G", "G", "A", "A"), ALT = c("T", "C", "T", "A", "G", "C")), 
    .Names = c("A", "C", "G", "T", "N", "=", "REF", "ALT"), 
    row.names = c("chr9:5078362-5078362.35", "chr9:5078361-5078361.36", 
                  "chr11:32439135-32439135.37", "chr12:112888210-112888210.38", 
                  "chr12:112888211-112888211.39", "chr12:112888211-112888211.40"), 
    class = "data.frame"
)

My dataframe:

mydf

                              A  C  G  T  N  = REF ALT
chr9:5078362-5078362.35      10 NA NA 12 NA NA   A   T
chr9:5078361-5078361.36      NA 11 10 NA NA NA   G   C
chr11:32439135-32439135.37    1 11 29 NA NA NA   G   T
chr12:112888210-112888210.38 NA 46 22 NA NA NA   G   A
chr12:112888211-112888211.39 22 NA NA 47 NA NA   A   G
chr12:112888211-112888211.40 22 NA NA 47 NA NA   A   C

Expected output:

10, 12
10, 11
40, 1
68, NA
69, NA
69, NA

David Arenburg · Accepted Answer

Here's not vectorized attempt. Below is the vectorized alternative

Define the similar values into same vectors

f <- c("A", "T")
s <- c("C", "G")

Define the function

foo <-  function(x){
  ## Save the "REF and "ALT" columns into temporary object in order
  ## to avoid it recalculation
  temp <- x[c("REF", "ALT")]

  ## This checks if both "REF and "ALT" are in `f` or in `c`, if so, 
  ## it just subsets mydf according to the values in x[c("REF", "ALT")]
  if(setequal(temp, f) || setequal(temp, s)){
    toString(x[temp])

  ## If not, then checks the order of appearance of either values in f or s
  ## and then sums the subsets accordingly
  } else if (temp[1L] %in% f) {    
    paste(sum(as.numeric(x[f]), na.rm = TRUE), sum(as.numeric(x[s]), na.rm = TRUE), sep = ", ")
  } else {
    paste(sum(as.numeric(x[s]), na.rm = TRUE), sum(as.numeric(x[f]), na.rm = TRUE), sep = ", ")
  }
}  

matrix(apply(mydf, 1, foo), ncol = 1)
#      [,1]    
# [1,] "10, 12"
# [2,] "10, 11"
# [3,] "40, 1" 
# [4,] "68, 0" 
# [5,] "69, 0" 
# [6,] "69, 0"

EDIT

I promised a vectorized approach, so here's an attempt

temp <- mydf[c("REF", "ALT")]

findx <- rowSums((temp == f[col(temp)]) | (temp[2:1] == f[col(temp)])) > 1
sindx <- rowSums((temp == s[col(temp)]) | (temp[2:1] == s[col(temp)])) > 1
tindx <- !(findx + sindx) & temp[, "REF"] %in% f
foindx <- !(findx + sindx + tindx)

res <- matrix(NA, nrow(mydf), 2)

res[findx, ] <- as.matrix(mydf[findx, f])
res[sindx, ] <- as.matrix(mydf[sindx, s])

temp2 <- cbind(rowSums(mydf[, f], na.rm = TRUE), rowSums(mydf[, s], na.rm = TRUE))
res[tindx, ] <- temp2[tindx, ]
res[foindx, ] <- temp2[foindx, 2:1]


#      [,1] [,2]
# [1,]   10   12
# [2,]   11   10
# [3,]   40    1
# [4,]   68    0
# [5,]   69    0
# [6,]   69    0

Perform Genotyping data manipulation

Answers (2)

Related Questions