Codezy
Codezy

Reputation: 662

Find the minimum hamming distance between a string and a long vector of strings (fast)

I need to calculate the hamming distance between an input string and a large string dataset. (All strings in the dataset have the same length of the input string.)

For example, if

input <- "YNYYEY"
dataset <- c("YNYYEE", "YNYYYY", "YNENEN", "YNYYEY")

the hamming distance between input and each string in dataset is 1, 1, 3, 0 so the minimum is 0. I have written a function to calculate the hamming distance between two strings:

HD <- function(str1, str2){

   str1 <- as.character(str1)
   str2 <- as.character(str2)

   length.str1 <- nchar(str1)
   length.str2 <- nchar(str2)

   string.temp1 <- c()
   for (i in 1:length.str1){
     string.temp1[i] = substr(str1, start=i, stop=i)
   }
   string.temp2 <- c()
   for (i in 1:length.str2){
     string.temp2[i] = substr(str2, start=i, stop=i)
   }
   return(sum(string.temp1 != string.temp2))
   }

But the dataset is too big so I need to speed it up, do you have any idea that I can do it quickly? Thank you for your help.

Upvotes: 2

Views: 2448

Answers (2)

Zheyuan Li
Zheyuan Li

Reputation: 73265

At R level you can use strsplit, cbind, !=, colSums and min. They are all "vectorized".

a <- "YNYYEY"
b <- c("YNYYEE", "YNYYYY", "YNENEN", "YNYYEY")
A <- strsplit(a, split = "")[[1]]
#[1] "Y" "N" "Y" "Y" "E" "Y"
B <- do.call("cbind", strsplit(b, split = ""))
#     [,1] [,2] [,3] [,4]
#[1,] "Y"  "Y"  "Y"  "Y" 
#[2,] "N"  "N"  "N"  "N" 
#[3,] "Y"  "Y"  "E"  "Y" 
#[4,] "Y"  "Y"  "N"  "Y" 
#[5,] "E"  "Y"  "E"  "E" 
#[6,] "E"  "Y"  "N"  "Y" 
D <- colSums(A != B)
#[1] 1 1 3 0
min(D)
#[1] 0

This kind of "vectorization" creates many temporary matrices / vectors and uses plenty of RAM. But hopefully it is worthwhile.

At C / C++ level you can do much much better (see a case study at here), but I am not keen on writing C / C++ code today.


I come across the stringdist package (there is even a tag). The function stringdist relies on a workhorse routine stringdist:::do_dist, which is written in C. It spares my effort.

library(stringdist)
d <- stringdist(a, b, method = "hamming")
#[1] 1 1 3 0
min(d)
#[1] 0

stringdist() runs almost ten times slower than colSum().

That is really interesting. Probably its C code or R code is doing something else complicated.

Upvotes: 3

Sal-laS
Sal-laS

Reputation: 11639

You cannot improve it better than O(n) it means that you have to review all the dataset, and calculate the distance for each observation.

The only improvement can happen on your dataset, if you sort all the observations based on a given point. In this case you maybe easier to find a string in dataset (0 distance results). This is the only improvement which you may do.

Upvotes: 2

Related Questions