Reputation: 5154
I have a large vector of strings like this:
d <- c("herb", "market", "merchandise", "fun", "casket93", "old", "herbb", "basket", "bottle", "plastic", "baskket", "markket", "pasword", "plastik", "oldg", "mahagony", "mahaagoni", "sim23", "asket", "trump" )
I wan't to fetch similar strings for each string from the same vector d.
I am doing this by
1. calculating for each string the edit distance with all other strings strings based on certain rules such as forcing exact matching if any digits are present or if number of alphabet characters are less than 5.
2. putting it in a dataframe dist along with string.
3. subsetting dist based on distances < 3.
4. collapsing and adding the similar strings to original dataframe as a new column.
I am using the stringr
and stringdist
packages
d <-as.data.frame(d)
M <- nrow(d)
Dist <- data.frame(matrix(nrow=M, ncol=2))
colnames(Dist) <- c("string" ,"dist")
Dist$string <- d$d
d$sim <- character(length=M)
require(stringr)
require(stringdist)
for (i in 1:M){
# if string has digits or is of short size (<5) do exact matching
if (grepl("[[:digit:]]", d[i, "d"], ignore.case=TRUE) == TRUE || str_count(d[i, "d"], "[[:alpha:]]") < 5){
Dist$dist <- stringdist(d[i, "d"], d$d, method="lv", maxDist=0.000001) # maxDist as fraction to force exact matching
# otherwise do approximate matching
} else {
Dist$dist <- stringdist(d[i, "d"], d$d, method="lv", maxDist=3)
}
# subset similar strings (with edit distance <3)
subDist <- subset(Dist, dist < 3 )
# add to original data.frame d
d[i, "sim"] <- paste(as.character(unlist(subDist$string)), collapse=", ")
}
Is it possible to vectorise the procedure instead of using a loop? I have a very large vector of strings, so a calculating a distance matrix using stringdistmatrix
on the entire vector can't be done due to memory restrictions. The loop works fine for large data, but is very slow.
Upvotes: 1
Views: 780
Reputation: 7941
stringdist
has a version for computing all the distances in a matrix, so I think that something like this will be an improvement, it's about four times as quick on my computer when run with the 100 reps line included:
d <- c("herb", "market", "merchandise", "fun", "casket93", "old", "herbb", "basket", "bottle", "plastic", "baskket", "markket", "pasword", "plastik", "oldg", "mahagony", "mahaagoni", "sim23", "asket", "trump" )
#d <- rep(d, each=100) #make it a bit longer for timing
d <-as.data.frame(d)
M <- nrow(d)
Dist <- data.frame(matrix(nrow=M, ncol=2))
colnames(Dist) <- c("string" ,"dist")
Dist$string <- d$d
d$sim <- character(length=M)
require(stringr)
require(stringdist)
ind_short <- grepl("[[:digit:]]", d[i, "d"], ignore.case=TRUE) == TRUE | str_count(d$d, "[[:alpha:]]") < 5
short <- stringdistmatrix(d$d[ind_short], d$d, method="lv", maxDist=0.000001)
long <- stringdistmatrix(d$d[!ind_short], d$d, method="lv", maxDist=3)
d$sim[ind_short] <- apply(short,1,function(x)paste(as.character(unlist(d$d[x<3])), collapse=", "))
d$sim[!ind_short] <- apply(long,1,function(x)paste(as.character(unlist(d$d[x<3])), collapse=", "))
The basic strategy is to split into short and long components, and use the matrix form of stringdist, then collapse these using paste, and assign to the right places in your d$sim
Edited to add: in the light of your comment about not being able to work on the whole matrix at once, try choosing chunk_length so that stringdistmatrix()
works on a chunk_length*M
matrix. Of course, if you set it to 1, you're back to your original unvectorised form
chunk_length <- 100
ind_short <- grepl("[[:digit:]]", d[i, "d"], ignore.case=TRUE) == TRUE | str_count(d$d, "[[:alpha:]]") < 5
d$iter <- rep(1:M,each=chunk_length,length.out=M)
for (i in unique(d$iter))
{
in_iter <- (d$iter == i)
short <- stringdistmatrix(d$d[in_iter & ind_short], d$d, method="lv", maxDist=0.000001)
long <- stringdistmatrix(d$d[in_iter & !ind_short], d$d, method="lv", maxDist=3)
if(sum(in_iter & ind_short)==1) short <- t(short)
if(sum(in_iter & !ind_short)==1) long <- t(long)
if(sum(in_iter & ind_short)>0) d$sim[in_iter & ind_short] <- apply(short,1,function(x)paste(as.character(unlist(d$d[x<3])), collapse=", "))
if(sum(in_iter & !ind_short)>0) d$sim[in_iter & !ind_short] <- apply(long,1,function(x)paste(as.character(unlist(d$d[x<3])), collapse=", "))
}
Upvotes: 1
Reputation: 99331
It's not really an answer, but I thought it might be good to mention that agrep
may be useful for you in this project. It does partial pattern matching.
> d <- c("herb", "market", "merchandise", "fun", "casket93",
"old", "herbb", "basket", "bottle", "plastic", "baskket",
"markket", "pasword", "plastik", "oldg", "mahagony",
"mahaagoni", "sim23", "asket", "trump" )
> agr <- sapply(d, function(x) agrep(x, d, value = TRUE))
> head(agr)
$herb
[1] "herb" "herbb"
$market
[1] "market" "markket"
$merchandise
[1] "merchandise"
$fun
[1] "fun"
$casket93
[1] "casket93"
$old
[1] "old" "pasword" "oldg"
Upvotes: 0