Reputation: 662
I need to calculate the hamming distance between an input string and a large string dataset. (All strings in the dataset have the same length of the input string.)
For example, if
input <- "YNYYEY"
dataset <- c("YNYYEE", "YNYYYY", "YNENEN", "YNYYEY")
the hamming distance between input
and each string in dataset
is 1, 1, 3, 0 so the minimum is 0. I have written a function to calculate the hamming distance between two strings:
HD <- function(str1, str2){
str1 <- as.character(str1)
str2 <- as.character(str2)
length.str1 <- nchar(str1)
length.str2 <- nchar(str2)
string.temp1 <- c()
for (i in 1:length.str1){
string.temp1[i] = substr(str1, start=i, stop=i)
}
string.temp2 <- c()
for (i in 1:length.str2){
string.temp2[i] = substr(str2, start=i, stop=i)
}
return(sum(string.temp1 != string.temp2))
}
But the dataset is too big so I need to speed it up, do you have any idea that I can do it quickly? Thank you for your help.
Upvotes: 2
Views: 2448
Reputation: 73265
At R level you can use strsplit
, cbind
, !=
, colSums
and min
. They are all "vectorized".
a <- "YNYYEY"
b <- c("YNYYEE", "YNYYYY", "YNENEN", "YNYYEY")
A <- strsplit(a, split = "")[[1]]
#[1] "Y" "N" "Y" "Y" "E" "Y"
B <- do.call("cbind", strsplit(b, split = ""))
# [,1] [,2] [,3] [,4]
#[1,] "Y" "Y" "Y" "Y"
#[2,] "N" "N" "N" "N"
#[3,] "Y" "Y" "E" "Y"
#[4,] "Y" "Y" "N" "Y"
#[5,] "E" "Y" "E" "E"
#[6,] "E" "Y" "N" "Y"
D <- colSums(A != B)
#[1] 1 1 3 0
min(D)
#[1] 0
This kind of "vectorization" creates many temporary matrices / vectors and uses plenty of RAM. But hopefully it is worthwhile.
At C / C++ level you can do much much better (see a case study at here), but I am not keen on writing C / C++ code today.
I come across the stringdist
package (there is even a stringdist tag). The function stringdist
relies on a workhorse routine stringdist:::do_dist
, which is written in C. It spares my effort.
library(stringdist)
d <- stringdist(a, b, method = "hamming")
#[1] 1 1 3 0
min(d)
#[1] 0
stringdist()
runs almost ten times slower thancolSum()
.
That is really interesting. Probably its C code or R code is doing something else complicated.
Upvotes: 3
Reputation: 11639
You cannot improve it better than O(n)
it means that you have to review all the dataset, and calculate the distance for each observation.
The only improvement can happen on your dataset, if you sort
all the observations based on a given point. In this case you maybe easier to find a string in dataset (0 distance results). This is the only improvement which you may do.
Upvotes: 2