Reputation: 10253

R - Longest common substring

Does anyone know of an R package that solves the longest common substring problem? I am looking for something fast that could work on vectors.

Upvotes: 9

Answers (5)

Julien

Reputation: 1702

The LCSn function (PTXQC package) can find the longest common string for all strings in the input vector. A warning is that the shortest string is used as the base, so it might not give you what you want when comparing multiple strings. However, it is a good option to compare two sequences.

library(PTXQC)
LCSn(c("hello","hello world"))
#[1] "hello"

LCSn(c("hello", "hel123l5678o"))
#[1] "hel"

Answer taken from here

Upvotes: 1

Shane

Reputation: 100194

Check out the "Rlibstree" package on omegahat Github

This uses http://www.icir.org/christian/libstree/.

Upvotes: 5

Michael Ohlrogge

Reputation: 10990

The question here is not totally clear on the intended application of the solution to the longest common substring problem. A common application that I encounter is matching between names in different datasets. The stringdist package has a useful function amatch() which I find suitable for this task.

In brief, amatch() will take as input two vectors, the first is x the vector of strings that you want to find matches from (this can also just be a single string), the second is table, which is the vector of strings you want to make comparisons to and choose the match with the longest common substring. amatch() will then return a vector whose length equals that of x - each element of this result will be an index in table that contains the best match.

Details: amatch() takes a method argument, which you specify to be lcs if you want matching on longest common substring. There are many other options for different string matching techniques (e.g. Levenshtein distance). There is also a mandatory maxDist argument. If all strings in table are greater "distance" from a given string in x, then amatch() will return NA for that element of its output. "distance" is defined differently depending on the string matching algorithm you choose. For lcs, it (more or less) just means how many different (non-matched) characters there are. See documentation for details.

Parallelization: another nice feature of amatch() is that it will automatically parallelize the operation for you, making reasonable guesses about system resources to use. If you want more control over this, you can toggle the nthread argument.

Example application:

library(stringdist)

Names1 = c(
"SILVER EAGLE REFINING, INC. (SW)",
"ANTELOPE REFINING",
"ANTELOPE REFINING (DOUGLAS FACILITY)"
)

Names2 = c(
"Mobile Concrete, Inc.",
"Antelope Refining, LLC. ",
"Silver Eagle Refining Inc."
)

Match_Idx = amatch(tolower(Names1), tolower(Names2), method = 'lcs', maxDist = Inf)
Match_Idx
# [1] 3 2 2

Matches = data.frame(Names1, Names2[Match_Idx])
Matches

#                                 Names1          Names2.Match_Idx.
# 1     silver eagle refining, inc. (sw) silver eagle refining inc.
# 2                    antelope refining   antelope refining, llc. 
# 3 antelope refining (douglas facility)   antelope refining, llc. 

### Compare Matches:

Matches$Distance = stringdist(Matches$Names1, Matches$Match, method = 'lcs')

Also, unlike functions like LCS from qualV, this will not consider "subsequence" matches that involve ignoring intermediate characters in order to form a match (as discussed here). For instance, see this:

Names1 = c(
"hello"
)

Names2 = c(
"hel123l5678o",
"hell"
)

Match_Idx = amatch(tolower(Names1), tolower(Names2), method = 'lcs', maxDist = Inf)

Matches = data.frame(Names1, Match = Names2[Match_Idx])
Matches

# 1  hello  hell

Upvotes: 2