Crops
Crops

Reputation: 5154

Vectorisation of a similar string finding loop

I have a large vector of strings like this:

d <- c("herb", "market", "merchandise", "fun", "casket93", "old", "herbb", "basket", "bottle", "plastic", "baskket", "markket", "pasword", "plastik", "oldg", "mahagony", "mahaagoni", "sim23", "asket", "trump" )

I wan't to fetch similar strings for each string from the same vector d.

I am doing this by
1. calculating for each string the edit distance with all other strings strings based on certain rules such as forcing exact matching if any digits are present or if number of alphabet characters are less than 5.
2. putting it in a dataframe dist along with string.
3. subsetting dist based on distances < 3.
4. collapsing and adding the similar strings to original dataframe as a new column.

I am using the stringr and stringdist packages

d <-as.data.frame(d)
M <- nrow(d)
Dist <- data.frame(matrix(nrow=M, ncol=2)) 
colnames(Dist) <- c("string" ,"dist")
Dist$string <- d$d
d$sim <- character(length=M)

require(stringr)
require(stringdist)

for (i in 1:M){
  # if string has digits or is of short size (<5) do exact matching
  if (grepl("[[:digit:]]", d[i, "d"], ignore.case=TRUE) == TRUE || str_count(d[i, "d"], "[[:alpha:]]") < 5){
    Dist$dist <- stringdist(d[i, "d"], d$d, method="lv", maxDist=0.000001) # maxDist as fraction to force exact matching
  # otherwise do approximate matching
  } else  {
    Dist$dist <- stringdist(d[i, "d"], d$d, method="lv", maxDist=3)
  }
  # subset similar strings (with edit distance <3)
  subDist <- subset(Dist, dist < 3 )
  # add to original data.frame d
  d[i, "sim"] <- paste(as.character(unlist(subDist$string)), collapse=", ")
}

Is it possible to vectorise the procedure instead of using a loop? I have a very large vector of strings, so a calculating a distance matrix using stringdistmatrix on the entire vector can't be done due to memory restrictions. The loop works fine for large data, but is very slow.

Upvotes: 1

Views: 780

Answers (2)

Miff
Miff

Reputation: 7941

stringdist has a version for computing all the distances in a matrix, so I think that something like this will be an improvement, it's about four times as quick on my computer when run with the 100 reps line included:

d <- c("herb", "market", "merchandise", "fun", "casket93", "old", "herbb", "basket", "bottle", "plastic", "baskket", "markket", "pasword", "plastik", "oldg", "mahagony", "mahaagoni", "sim23", "asket", "trump" )
#d <- rep(d, each=100) #make it a bit longer for timing

d <-as.data.frame(d)
M <- nrow(d)
Dist <- data.frame(matrix(nrow=M, ncol=2))
colnames(Dist) <- c("string" ,"dist")
Dist$string <- d$d
d$sim <- character(length=M)

require(stringr)
require(stringdist)

ind_short <- grepl("[[:digit:]]", d[i, "d"], ignore.case=TRUE) == TRUE | str_count(d$d, "[[:alpha:]]") < 5

short <- stringdistmatrix(d$d[ind_short], d$d, method="lv", maxDist=0.000001)
long <- stringdistmatrix(d$d[!ind_short], d$d, method="lv", maxDist=3)

d$sim[ind_short] <- apply(short,1,function(x)paste(as.character(unlist(d$d[x<3])), collapse=", "))
d$sim[!ind_short] <- apply(long,1,function(x)paste(as.character(unlist(d$d[x<3])), collapse=", "))

The basic strategy is to split into short and long components, and use the matrix form of stringdist, then collapse these using paste, and assign to the right places in your d$sim


Edited to add: in the light of your comment about not being able to work on the whole matrix at once, try choosing chunk_length so that stringdistmatrix() works on a chunk_length*M matrix. Of course, if you set it to 1, you're back to your original unvectorised form

chunk_length <- 100
ind_short <- grepl("[[:digit:]]", d[i, "d"], ignore.case=TRUE) == TRUE | str_count(d$d, "[[:alpha:]]") < 5
d$iter <- rep(1:M,each=chunk_length,length.out=M)

for (i in unique(d$iter))
{
  in_iter <- (d$iter == i)
  short <- stringdistmatrix(d$d[in_iter & ind_short], d$d, method="lv", maxDist=0.000001)
  long <- stringdistmatrix(d$d[in_iter & !ind_short], d$d, method="lv", maxDist=3)

  if(sum(in_iter & ind_short)==1) short <- t(short)
  if(sum(in_iter & !ind_short)==1) long <- t(long)

  if(sum(in_iter & ind_short)>0) d$sim[in_iter & ind_short] <- apply(short,1,function(x)paste(as.character(unlist(d$d[x<3])), collapse=", "))
  if(sum(in_iter & !ind_short)>0) d$sim[in_iter & !ind_short] <- apply(long,1,function(x)paste(as.character(unlist(d$d[x<3])), collapse=", "))
}

Upvotes: 1

Rich Scriven
Rich Scriven

Reputation: 99331

It's not really an answer, but I thought it might be good to mention that agrep may be useful for you in this project. It does partial pattern matching.

> d <- c("herb", "market", "merchandise", "fun", "casket93", 
         "old", "herbb", "basket", "bottle", "plastic", "baskket",
         "markket", "pasword", "plastik", "oldg", "mahagony", 
         "mahaagoni", "sim23", "asket", "trump" )
> agr <- sapply(d, function(x) agrep(x, d, value = TRUE))
> head(agr)
$herb
[1] "herb"  "herbb"

$market
[1] "market"  "markket"

$merchandise
[1] "merchandise"

$fun
[1] "fun"

$casket93
[1] "casket93"

$old
[1] "old"     "pasword" "oldg"   

Upvotes: 0

Related Questions