mhbrugman
mhbrugman

Reputation: 21

Parrelize a nested for loop in R

I have a dataframe with DNA barcodes in rownames, for which I would like to determine the difference (e.g. Levenshtein distance) between these barcodes. The values in the dataframe need to be processed later in the analysis. I've worked out an example which uses a slightly simplified analysis just comparing the individual bases (A,T,G,C) after a strsplit and puts the results in a matrix:

results <- matrix(data=NA,nrow=dim(vals)[1],ncol=dim(vals)[1])

# Do the string splitting and comparison of the barcodes one by one.
system.time(
    for (i in 1:dim(dat)[1]) {
        for (j in 1:dim(dat)[1]) {
        results[i,j] <- sum(unlist(strsplit(rownames(dat)[i], split="")) !=  unlist(strsplit(rownames(dat)[j], split="")))
        }
    }   
)

This all works as expected but off course is embarrasingly parallel. To save some time and to put our university cluster to good use, I would like to try and parallelize this function, but I'm having trouble getting it right. Hints would be appreciated!

Upvotes: 0

Views: 81

Answers (1)

Andrie
Andrie

Reputation: 179468

Parallelisation should be the last step in optimising your code, after you've implemented the easier steps that should include:

  • Vectorisation
  • Using built-in high performance functions

In your case, you should use adist to compute the levenshtein distance.

# Function to simulate barcodes of given length
g <- function(n)paste(sample(c("G", "A", "C", "T"), size=n, replace=TRUE), collapse="")

# Replicate data
barcodes <- replicate(5, g(n=4))

Then use adist():

barcodes

[1] "CTAA" "AGGC" "CACT" "GGCG" "TTGA"


adist(barcodes, barcodes)
     [,1] [,2] [,3] [,4] [,5]
[1,]    0    4    3    4    2
[2,]    4    0    4    2    3
[3,]    3    4    0    3    4
[4,]    4    2    3    0    4
[5,]    2    3    4    4    0

Upvotes: 1

Related Questions