Reputation: 5719
I have this dataframe called mydf, where The REF
and ALT
columns could have letters "A", "T", "G","C".
In columns A,C,G,T
, A is same as T and G is same as C. Meaning whatever we have in A and T should be added together and whatever values we have in G and C should be added together. I want to match the letters in columns A,C,G,T and get the corresponding values in "REF, ALT" format as shown in expected output.
mydf <- structure(
list(A = c(10, NA, 1, NA, 22, 22), C = c(NA, 11, 11, 46, NA, NA),
G = c(NA, 10, 29, 22, NA, NA), T = c(12, NA, NA, NA, 47, 47),
N = c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_),
`=` = c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_),
REF = c("A", "G", "G", "G", "A", "A"), ALT = c("T", "C", "T", "A", "G", "C")),
.Names = c("A", "C", "G", "T", "N", "=", "REF", "ALT"),
row.names = c("chr9:5078362-5078362.35", "chr9:5078361-5078361.36",
"chr11:32439135-32439135.37", "chr12:112888210-112888210.38",
"chr12:112888211-112888211.39", "chr12:112888211-112888211.40"),
class = "data.frame"
)
My dataframe:
mydf
A C G T N = REF ALT
chr9:5078362-5078362.35 10 NA NA 12 NA NA A T
chr9:5078361-5078361.36 NA 11 10 NA NA NA G C
chr11:32439135-32439135.37 1 11 29 NA NA NA G T
chr12:112888210-112888210.38 NA 46 22 NA NA NA G A
chr12:112888211-112888211.39 22 NA NA 47 NA NA A G
chr12:112888211-112888211.40 22 NA NA 47 NA NA A C
Expected output:
10, 12
10, 11
40, 1
68, NA
69, NA
69, NA
Upvotes: 3
Views: 83
Reputation: 31171
You can use this approach:
n = names(mydf)
f = function(r, p, q)
{
if(p+q==5) return(c(mydf[r,p], mydf[r,q]))
c(sum(as.numeric(mydf[cbind(r, c(p,5-p))]), na.rm=T),
sum(as.numeric(mydf[cbind(r, c(q,5-q))]), na.rm=T))
}
do.call(rbind,Map(f, 1:nrow(mydf), match(mydf$REF,n), match(mydf$ALT,n)))
# [,1] [,2]
#[1,] 10 12
#[2,] 10 11
#[3,] 40 1
#[4,] 68 0
#[5,] 69 0
#[6,] 69 0
This approach is using a hack: you can notice in mydf
that if ALT
and REF
contains column number 1
and 4
(sum = 5
) or 2
and 3
(sum = 5
), you just need to return the values on this respective columns.
Upvotes: 3
Reputation: 92282
Here's not vectorized attempt. Below is the vectorized alternative
Define the similar values into same vectors
f <- c("A", "T")
s <- c("C", "G")
Define the function
foo <- function(x){
## Save the "REF and "ALT" columns into temporary object in order
## to avoid it recalculation
temp <- x[c("REF", "ALT")]
## This checks if both "REF and "ALT" are in `f` or in `c`, if so,
## it just subsets mydf according to the values in x[c("REF", "ALT")]
if(setequal(temp, f) || setequal(temp, s)){
toString(x[temp])
## If not, then checks the order of appearance of either values in f or s
## and then sums the subsets accordingly
} else if (temp[1L] %in% f) {
paste(sum(as.numeric(x[f]), na.rm = TRUE), sum(as.numeric(x[s]), na.rm = TRUE), sep = ", ")
} else {
paste(sum(as.numeric(x[s]), na.rm = TRUE), sum(as.numeric(x[f]), na.rm = TRUE), sep = ", ")
}
}
matrix(apply(mydf, 1, foo), ncol = 1)
# [,1]
# [1,] "10, 12"
# [2,] "10, 11"
# [3,] "40, 1"
# [4,] "68, 0"
# [5,] "69, 0"
# [6,] "69, 0"
EDIT
I promised a vectorized approach, so here's an attempt
temp <- mydf[c("REF", "ALT")]
findx <- rowSums((temp == f[col(temp)]) | (temp[2:1] == f[col(temp)])) > 1
sindx <- rowSums((temp == s[col(temp)]) | (temp[2:1] == s[col(temp)])) > 1
tindx <- !(findx + sindx) & temp[, "REF"] %in% f
foindx <- !(findx + sindx + tindx)
res <- matrix(NA, nrow(mydf), 2)
res[findx, ] <- as.matrix(mydf[findx, f])
res[sindx, ] <- as.matrix(mydf[sindx, s])
temp2 <- cbind(rowSums(mydf[, f], na.rm = TRUE), rowSums(mydf[, s], na.rm = TRUE))
res[tindx, ] <- temp2[tindx, ]
res[foindx, ] <- temp2[foindx, 2:1]
# [,1] [,2]
# [1,] 10 12
# [2,] 11 10
# [3,] 40 1
# [4,] 68 0
# [5,] 69 0
# [6,] 69 0
Upvotes: 4