Nicholas
Nicholas

Reputation: 93

Is there a way to speed up the R package stringdist significantly, e.g. by using Rcpp?

I have a loop, in which I have to calculate a distance between one string and a vector of many strings. I use the package "stringdist" and the function of the same name, which works well.

However, it takes quite some time to calculate the distances each time. For example, to get the distances between one word and 3.5 million other words takes roughly 0.5 seconds. That does not seem much, but to do that 3.5 million times does take too long.

I cannot do a distance matrix, as this would be too large and I split up the calculations to avoid having to calculate a complete matrix, which would be 3.5 million by 3.5 million.

Is there possibly a way to compute Levenshtein and/or Hamming distances using Rcpp to speed this up (a lot)?

I have tried the compiler package and using "cmpfun" but this does not change the speed. I assume I have to write the function stringdist using C++? Unfortunately I have no idea how.

The stringdist is the part of a loop which takes >95% of the time of that loop step, so a reduction would help immensely.

Any help would be appreciated.

Edit 1:

This is a small vector of strings as example:

bioID
  [1] "F"                 "FPhGLGcbLFF"       "FhGWhSLYcmYF"      "FhGcGbGYhF"        "GGLcSFhGmhLYF"     "GGbhhcLFF"        
  [7] "GLWGcGLmhcLFF"     "GLYmcmFF"          "GLbcmYF"           "GLhhFGmGccmFF"     "GLhhLGYLbGmFF"     "GLhhLGbGchmYF"    
 [13] "GLhhLGmLYcmYF"     "GLhhLLLGmcmFF"     "GLhhLhGGGcmYF"     "GLhhPPmmchmYF"     "GLhhmGbGLcmYF"     "GLhmYbGmmPmbF"    
 [19] "GLhmcbLFF"         "GPhbhYmhPLbF"      "GbhmLFF"           "GhhYcmYF"          "GmGhGYhcLFF"       "GmbmbmhcLFF"      
 [25] "LGGYmcmFF"         "LGLGmPmbF"         "LGbF"              "LGhbLchmYF"        "LLGLYhGcLFF"       "LLPGhhbPLmcmFF"   
 [31] "LLcmmPPmhcLFF"     "LLhhLLGLhhYmcmFF"  "LPPhcbLFF"         "LYcmYF"            "LbGmmPmbF"         "LbLLGGccmYF"      
 [37] "LhPbGchmYF"        "LhbGbmYGYhF"       "LmhGLmLLhF"        "PGYLhGcGYhF"       "PLhhLLGLhhYmcmFF"  "PLhhchhGhGLccmFF" 
 [43] "PLhmGLhhPmGGcmFF"  "PbLhhbLmhGcLFF"    "PbbcbGbcLGYhF"     "PbhLcLGmhLYF"      "PcLFF"             "PcPcLFF"          
 [49] "PhbcLSmcmFF"       "PmYcmYF"           "PmbF"              "SFFbmbhLYcmYF"     "SGGGbhchmYF"       "SGGPhLGcbLFF"     
 [55] "SGGmGcmhGcLFF"     "SGLGcFGhcLFF"      "SGLGmGLGcmYF"      "SGLLGhGmhLYF"      "SGLPbPmYmcmFF"     "SGLWhGcGbLFF"     
 [61] "SGLmmLmhcLFF"      "SGPLbbmmPmbF"      "SGPmhLcbcchmYF"    "SGSGGbLhchmYF"     "SGWGYLGcmYF"       "SGWhLbPLbcmYF"    
 [67] "SGbGGmhLYF"        "SGcbLFGmmPmbF"     "SGcmWGGGLLbF"      "SGhLLGGLmcmYF"     "SGhbhGPcYcmYF"     "SGmGGLLFLYmcmFF"  
 [73] "SGmLGLLmPmbF"      "SLFGGhGbbLcLFF"    "SLFGbGFhcLFF"      "SLFGmGGhGLmLLhF"   "SLFPFbcLLLYcmYF"   "SLFPLLGGhchmYF"   
 [79] "SLFSFbcLFF"        "SLFbGhcmYGYhF"     "SLFbGmLYGYhF"      "SLFcGGLccbLFF"     "SLFhGLLmhcLFF"     "SLFmGLbcmGmcmFF"  
 [85] "SLFmPchmYcmYF"     "SLFmbPLGLmLLhF"    "SLGGGLLFYmcmFF"    "SLGGGLLGmhcLFF"    "SLGGGLmcbLFF"      "SLGGGYmcmFF"      
 [91] "SLGGGhGLmLLhF"     "SLGGGhLcYmcmFF"    "SLGGGhhcLFF"       "SLGGLGYhmcmFF"     "SLGGLLGcYmcmFF"    "SLGGLLhGcLFF"     
 [97] "SLGGLhFhcLFF"      "SLGGSGLhGhhYmcmFF" "SLGGbLYcmYF"       "SLGGbbcLYGYhF"    

Edit 2: I used @Andrew's comment to re-think the problem and here is how I am working with it now:

As I right now only need the distances of 1 anyway, I only do the distance calculation of my string at hand with strings that are either the same length, one letter shorter or one letter longer. That reduces time significantly. It still takes a very long time for large datasets, but it already helps.

Upvotes: 0

Views: 788

Answers (1)

SpaceCadet3333
SpaceCadet3333

Reputation: 1

As mentioned before, foreach loops are a good place to start. However, depending on your machine, you can also optimize the amount of cores you're using by looking at microbenchmark and sys.time() on a small sample of data before running the whole thing. I just generated a stingdist matrix on ~2.6 million records by comparing it to about 36,000 records using 10 cores. It took about 48 hours. Still not terribly fast, but feasible given how computationally expensive this process is. It also looks like the stringdist martrix bottlenecks heavily at the foreach loop. You could maybe reduce overhead by breaking your data into chunks.

library(stringdist)
library(doParallel)
library(foreach)
library(parallel)
library(microbenchmark)

#experiment with cores
cores <- 10
cluster <- parallel::makeCluster(cores) #make cluster to utilize cores
registerDoParallel(cluster)#register cores

#blank vector
min_distances <- numeric(nrow(sales))

#foreach loop defining lv distance
system.time( min_distances <- foreach(i = 1:nrow(sales), .combine = "c") %dopar% {
  column <- sales$column[i]
  dist_vector <- stringdist::stringdistmatrix(column, companies$name, method = "lv")
  min_distance <- min(dist_vector)
  min_distances[i] <- min_distance
})

Upvotes: 0

Related Questions