allquixotic
allquixotic

Reputation: 1559

Longest substring that matches a string in an array

Assume I have the following input and that my implementation language is Java:

I need to (as efficiently as is practical on a typical computer) assemble a list of substrings of the string S that are contained (entirely) within an element of the array A, and get those in descending order. I also need to know the starting and ending character index within the string S of each match. ... But with some constraints.

The following constraints and peculiarities apply to this problem:

Working this out manually just by looking at the string and array, in this example, the solution would be the following, given in the correct descending order (zero-based indexing):

  1. The range of characters [20..34] ("jumped over the") is in index 1 of the array. Length = 15
  2. The range of characters [10..18] ("brown fox") is in index 0 of the array. Length = 9
  3. The range of characters [36..43] ("lazy dog") is in index 2 of the array. Length = 8
  4. The range of characters [49..55] ("ate pie") is in index 9 of the array (arbitrary match; matching "and ate" is equally valid, but we don't match both because "ate" is already "consumed"; no pun intended). Length = 7
  5. The range of characters [0..2] ("the") is in index 4 of the array. Length = 3
  6. The word "quick" was not matched to any element in the array.
  7. The word "and" was not matched to any element in the array.

Note, specifically, that "ox jumped over the laz", although it is the longest substring in A that is within S, is not matched in the result set because it violates the word boundaries of "fox" and "lazy".

Question: Am I describing a fairly standard algorithm that may exist in a library (in part or in whole; I am willing to build this out of simpler primitive building blocks) or is this something so custom that I need to implement it from scratch?

If I implement it from scratch, I think I need to take an approach broadly sketched out like the following:

Sounds slow... And probably moderately difficult to do right.

Upvotes: 6

Views: 654

Answers (1)

Frankie
Frankie

Reputation: 25165

You can easily do that resorting to regexes alone. While the following is demonstrative and does not comply with the extensive list of requests (namely putting the results in an array and ordering them) that's straightforward to implement.

The "tricky" part would be the word-boundary delimiter \b and using groups () to capture the actual group you want to want to match.

String[] A = {"brown fox", "jumped over the", "lazy dog", "dog", "the", "fish", "quantum burrito", "ox jumped over the laz", "and ate", "ate pie"};
String S = "the quick brown fox jumped over the lazy dog and ate pie";

for(String s : A) {
    Pattern p = Pattern.compile(".*\\b(" +s+ ")\\b.*");
    Matcher m = p.matcher(S);

    while (m.find()) {
        System.out.println(m.matches() + " => " + s);
        System.out.println("    Start index: " + m.start(1));
        System.out.println("    End index: " + m.end(1));
        System.out.println("    Length: " + m.group(1).length());
    }
}

The above matches all contained strings as long as they are space delimited and outputs their start/end position within the main string.

Upvotes: 1

Related Questions