gugy
gugy

Reputation: 115

R - make it faster: check matrixpositions for characters and put info into list (0/1)

so I have this code snippet that does what it is supposed to do but it is super slow and probably inefficient due to the use of for loops... And because I am using it on huge files it slows down my script considerably.

I am guessing R has a built in function which easily does what I am doing in for loops?

Does anyone have an idea how to make it faster?

what the code below does:

checks if at a position in the matrix, a character of the alphabet is present (1) or if it is another character (0). This info is then saved in a list.

Basically what I need to continue with is a true/false for the matrix for alphabet characters. I then use the true/false list for "renumbering the matrix elements" (so that the non-alphabet characters are not counted)

UPDATE:

what I mean by "renumbering the matrix elements": protein sequences are always numbered, so a protein of length 560 has 560 amino acids in its sequence. I you make an alignment of sequences, and their lengths are not identical (A:560 amino acids, B: 600 amino acids), the alignment will introduce gaps where the sequences do not match. My matrix is an alignment and has therefore gaps (non-alphabet characters, usually "- ") To be able to address position 100 of sequence A in the alignment, I need to renumber the alignment so that only "non-gap positions" have a number and then address that position. Otherwise, if I print position 100 of the alignment, it will not be position 100 of my sequence A.

FYI: This is for protein sequence alignments, and I want all the amino acids (alphabet characters) to be numbered, but not the gaps (other characters like "-" or "."). this later enables me to adress the positions where amino acids are specifically and analyse my huge alignments easier

If clarifications are needed please comment!

 MSAmatrix<-matrix(c("A","-","B", "-", "C","A","D","B", "-", "C","A","-","B", "F", "C","A","D",".", "-", "C"), nrow=4, byrow=TRUE)

 letters<-list()
 lettersrenumbered<-list()
 referencesequence<-1
 # for whatever reason I am initialising the lists wrong and they need to be filled with 1 element before I can use them in the next loops...
 for(i in 1:dim(MSAmatrix)[1]) {
 letters[[i]]<-1313
 lettersrenumbered[[i]]<-1313
 }
 # get info if position is an alphabet character or not
 for(i in 1:dim(MSAmatrix)[1]) {
     for(j in 1:dim(MSAmatrix)[2]) {
         if(grepl("[a-zA-Z]",MSAmatrix[i,])[j]){
            letters[[i]][j]<-1  
         }
         else{  
            letters[[i]][j]<-0
        }
     }
 }

 #renumber all the sequences so that only the alphabet characters get a number
 for(i in 1:dim(MSAmatrix)[1]) {
     count<-0
     for(j in 1:dim(MSAmatrix)[2]) {
         if(letters[[i]][j]==1){
            count<-count+1
            lettersrenumbered[[i]][j]<-count    
         }
         else{
            lettersrenumbered[[i]][j]<-" "  
         }
     }
 }

Upvotes: 0

Views: 53

Answers (2)

Hugh
Hugh

Reputation: 16089

On my machine the following is around 20 times faster than your method:

Create a matrix of the same dimensions, but all false

X <- matrix(rep(FALSE, 20), nrow = 4, byrow = TRUE)

Where the MSAmatrix is a capital letter, mark it as TRUE

X[MSAmatrix %in% LETTERS] <- TRUE

You can eke out a bit more speed (30%) by just creating the matrix directly, though it may be a little harder to assure yourself that it's correct. That is, by just:

matrix(MSAmatrix %in% LETTERS, nrow = 4, byrow = FALSE)

It's currently unclear what you mean by "renumbering the matrix elements", but if you use apply and cumsum

apply(X, 2, cumsum)
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    0    1    0    1
[2,]    2    1    2    0    2
[3,]    3    1    3    1    3
[4,]    4    2    3    1    4

I think you get close to what you intend.

Upvotes: 1

Miff
Miff

Reputation: 7941

Generally speaking R is quickest when you perform operations on whole vectors rather than individual elements, so you can split the greping out and write:

MSAmatrix<-matrix(c("A","-","B", "-", "C","A","D","B", "-", "C","A","-","B", "F", "C","A","D",".", "-", "C"), nrow=4, byrow=TRUE)
isChar <- matrix(grepl("[a-zA-Z]",MSAmatrix), nrow=nrow(MSAmatrix))

to get a matrix showing which elements are characters. The next step is working row-wise to create a list, so lapply is a useful place to start. This can be done with:

formatRow <- function(i){
  retval <- cumsum(isChar[i,])
  retval[!isChar] <- ""
  retval
}

lapply(1:nrow(MSAmatrix), formatRow)

for each row, the function uses cumsum to count the number of trues so far in the row, and then overwrites the ones not corresponding to letters with "", converting the whole vector to character.

Depending on what else your doing with the output, it may be more efficient to use apply rather than lapply and keep the output as a matrix rather than list.

Upvotes: 0

Related Questions