Reputation: 21
I have a dataframe with DNA barcodes in rownames, for which I would like to determine the difference (e.g. Levenshtein distance) between these barcodes. The values in the dataframe need to be processed later in the analysis. I've worked out an example which uses a slightly simplified analysis just comparing the individual bases (A,T,G,C) after a strsplit
and puts the results in a matrix:
results <- matrix(data=NA,nrow=dim(vals)[1],ncol=dim(vals)[1])
# Do the string splitting and comparison of the barcodes one by one.
system.time(
for (i in 1:dim(dat)[1]) {
for (j in 1:dim(dat)[1]) {
results[i,j] <- sum(unlist(strsplit(rownames(dat)[i], split="")) != unlist(strsplit(rownames(dat)[j], split="")))
}
}
)
This all works as expected but off course is embarrasingly parallel. To save some time and to put our university cluster to good use, I would like to try and parallelize this function, but I'm having trouble getting it right. Hints would be appreciated!
Upvotes: 0
Views: 81
Reputation: 179468
Parallelisation should be the last step in optimising your code, after you've implemented the easier steps that should include:
In your case, you should use adist
to compute the levenshtein distance.
# Function to simulate barcodes of given length
g <- function(n)paste(sample(c("G", "A", "C", "T"), size=n, replace=TRUE), collapse="")
# Replicate data
barcodes <- replicate(5, g(n=4))
Then use adist()
:
barcodes
[1] "CTAA" "AGGC" "CACT" "GGCG" "TTGA"
adist(barcodes, barcodes)
[,1] [,2] [,3] [,4] [,5]
[1,] 0 4 3 4 2
[2,] 4 0 4 2 3
[3,] 3 4 0 3 4
[4,] 4 2 3 0 4
[5,] 2 3 4 4 0
Upvotes: 1