user2380782
user2380782

Reputation: 1446

calculate a value from a matrix of characters

I have a matrix of 7358444 rows and 110 columns. The matrix is composed of caracter vectors and looks like this:

      FORMAT     eQTL188                      eQTL193                   eQTL178                   eQTL179                       eQTL238                      
[1,] "GT:DS:GP" "0/1:0.79:0.221,0.767,0.011" "0/0:0.031:0.97,0.03,0"   "0/0:0.033:0.967,0.033,0" "0/0:0.079:0.922,0.077,0.001" "0/0:0.344:0.664,0.329,0.007"
[2,] "GT:DS:GP" "0/0:0.047:0.953,0.047,0"    "0/0:0.007:0.993,0.007,0" "0/0:0.006:0.994,0.006,0" "0/0:0.008:0.992,0.008,0"     "0/1:0.525:0.477,0.52,0.002" 
[3,] "GT:DS:GP" "0/0:0.047:0.953,0.047,0"    "0/0:0.007:0.993,0.007,0" "0/0:0.006:0.994,0.006,0" "0/0:0.008:0.992,0.008,0"     "0/1:0.527:0.476,0.521,0.003"
[4,] "GT:DS:GP" "0/0:0.048:0.952,0.048,0"    "0/0:0.007:0.993,0.007,0" "0/0:0.006:0.994,0.006,0" "0/0:0.008:0.992,0.008,0"     "0/1:0.518:0.485,0.512,0.003"

I need to calculate for each one of my samples (the columns with the pattern eQTL) the dosage from allele1. This can be calculated using the GP values after the second : in each one of the columns. The formula I need to apply is P(A1) = 2*P(A1/A1) + P(A1/A2), where P1 is the first element after the second :, and A2 the second one.

The result (numeric matrix) that I am looking for would look like this

     eQTL188  eQTL193 eQTL178 eQTL179 eQTL238                      
[1,] 1.209    1.970   1.967   1.921   1.657
[2,] 1.953    1.903   1.994   1.992   1.474 
[3,] 1.953    1.993   1.994   1.992   1.473
[4,] 1.952    1.993   1.994   1.99    1.482

Since the matrix is quite huge, the speed could be an issue

Upvotes: 0

Views: 45

Answers (1)

erocoar
erocoar

Reputation: 5893

An approach would be to first retrieve the numbers after the 2nd :, then strsplit over the commas. The formula can be applied with lapply, consider e.g. this

df <- matrix(c("GT:DS:GP" ,"0/1:0.79:0.221,0.767,0.011" ,"0/0:0.031:0.97,0.03,0" ,  "0/0:0.033:0.967,0.033,0", "0/0:0.079:0.922,0.077,0.001", "0/0:0.344:0.664,0.329,0.007",
            "GT:DS:GP" ,"0/0:0.047:0.953,0.047,0" ,   "0/0:0.007:0.993,0.007,0", "0/0:0.006:0.994,0.006,0" ,"0/0:0.008:0.992,0.008,0"  ,   "0/1:0.525:0.477,0.52,0.002",
            "GT:DS:GP", "0/0:0.047:0.953,0.047,0"  ,  "0/0:0.007:0.993,0.007,0", "0/0:0.006:0.994,0.006,0" ,"0/0:0.008:0.992,0.008,0"  ,   "0/1:0.527:0.476,0.521,0.003",
            "GT:DS:GP" ,"0/0:0.048:0.952,0.048,0" ,   "0/0:0.007:0.993,0.007,0", "0/0:0.006:0.994,0.006,0" ,"0/0:0.008:0.992,0.008,0"   ,  "0/1:0.518:0.485,0.512,0.003"), 
            ncol=6, byrow=TRUE)


df <- df[, -1]
df <- gsub(".+:.+:(.*)", "\\1", df)
out <- lapply(strsplit(df, ","), function(x) {
  x <- as.numeric(x)
  return(2 * x[1] / x[1] + x[1] / x[2])
})

out <- do.call(rbind, out)
dim(out) <- dim(df)

          [,1]      [,2]      [,3]      [,4]     [,5]
[1,]  2.288136  34.33333  31.30303  13.97403 4.018237
[2,] 22.276596 143.85714 167.66667 126.00000 2.917308
[3,] 22.276596 143.85714 167.66667 126.00000 2.913628
[4,] 21.833333 143.85714 167.66667 126.00000 2.947266

Obviously the formula will have to be adjusted, as it seems to be a typo in your question

Upvotes: 1

Related Questions