Pair strings based on unique substrings

I have two sets of strings that I need to match, if possible, by identical substrings in each pair (bold text in examples below; bold/capitalization only done here for emphasis there's not a way to identify the key substring by looking at a list element on its own) that are unique within each list. The remaining portion (lorem ipsum) of the text may be common to many elements or may be completely unique.

List one:

  1. "Lorem ipsum dolor sit amet, CANDY BAR consectetur adipisicing elit,"
  2. "sed do eiusmod CANDY CANE tempor incididunt ut labore et dolore magna"
  3. "sed do eiusmod tempor HOMER incididunt ut labore et dolore magna"
  4. "Lorem ipsum dolor sit amet, consectetur adipisicing PICKUP TRUCK elit,"
  5. "ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute"

List two:

  1. "sed do eiusmod tempor incididunt HOMER ut labore et dolore magna"
  2. "aliqua. Ut enim ad minim veniam, CANDY BAR quis nostrud exercitation"
  3. "aliqua. Ut enim ad minim veniam, quis nostrud CANDY CANE exercitation"
  4. "irure dolor in reprehenderit in voluptate velit esse cillum dolore"
  5. "Lorem ipsum dolor sit amet, consectetur adipisicing PICKUP TRUCK elit,"

From the sample text below matches are: 1-2; 2-3; 3-1; 4-5

element 5 in list one and element 4 in list 2 don't match with anything.

Upvotes: 0

Views: 230

Answers (2)

jogojapan
jogojapan

Reputation: 69967

If the total amount of data you are dealing with is relatively small, the solutions already suggested (using .contains() or regular expressions) are probably most practical. The below is a way of doing it when the amount of data is much larger.

The key part of the solution is to use suffix arrays. A suffix array is a lexicographically sorted list of all suffixes (in the sense of string ends, not linguistic suffixes) of a text (or a concatenation of multiple texts).

In the example you've described, this would involve constructing a suffix array of the concatenated texts of one of the two sets only. I'll assume we do this for set 2, so we would concatenate all sentences, using a unique separation character (I have chosen the hash character # below):

 sed do eiusmod tempor incididunt HOMER ut labore et dolore magna#aliqua. Ut enim ad minim veniam, CANDY BAR quis nostrud exercitation#aliqua. Ut enim ad minim veniam, quis nostrud CANDY CANE exercitation#....

Next, you'd construct the suffix array of that string, along with the longest-common prefix array (LCP). Both data structures can be constructed using a brute force approach if the amount of text is not extremely large. Alternatively, there are libraries for building them more efficiently, for example jSuffixArrays.

Finally you iterate through the sentences of set 1, and in each sentence through the candidate starting positions of relevant tokens (probably only words that follow white space or punctuation) and search the suffix array of set 2 for them. Searching a suffix array when the LCP array is available can be done in O(n+m) time (n is the length of the concatenated string of set 2, m is the length of the candidate string you are looking for) using the classical search algorithm by Manber and Myers, but if that is still too slow, there are refined methods available, e.g. described by Navarro and Mäkinen 2007.

For each match you find, the suffix array can readily provide with information on how frequent the string occurs in set 2, and in how many different sentences. I can elaborate on how to do the latter in an edit to this post if needed.

Upvotes: 2

aretai
aretai

Reputation: 1641

As I understand you have a list of strings that are unique within each list. These strings are then part (substrings) of the Strings in the lists. I would create a list of such substrings and then compare them using regex (rather no java substring as in this case you would need to know start index).

Upvotes: 1

Related Questions