Reputation: 6061
I have two sets of strings that I need to match, if possible, by identical substrings in each pair (bold text in examples below; bold/capitalization only done here for emphasis there's not a way to identify the key substring by looking at a list element on its own) that are unique within each list. The remaining portion (lorem ipsum) of the text may be common to many elements or may be completely unique.
List one:
List two:
From the sample text below matches are: 1-2; 2-3; 3-1; 4-5
element 5 in list one and element 4 in list 2 don't match with anything.
Upvotes: 0
Views: 230
Reputation: 69967
If the total amount of data you are dealing with is relatively small, the solutions already suggested (using .contains()
or regular expressions) are probably most practical. The below is a way of doing it when the amount of data is much larger.
The key part of the solution is to use suffix arrays. A suffix array is a lexicographically sorted list of all suffixes (in the sense of string ends, not linguistic suffixes) of a text (or a concatenation of multiple texts).
In the example you've described, this would involve constructing a suffix array of the concatenated texts of one of the two sets only. I'll assume we do this for set 2, so we would concatenate all sentences, using a unique separation character (I have chosen the hash character #
below):
sed do eiusmod tempor incididunt HOMER ut labore et dolore magna#aliqua. Ut enim ad minim veniam, CANDY BAR quis nostrud exercitation#aliqua. Ut enim ad minim veniam, quis nostrud CANDY CANE exercitation#....
Next, you'd construct the suffix array of that string, along with the longest-common prefix array (LCP). Both data structures can be constructed using a brute force approach if the amount of text is not extremely large. Alternatively, there are libraries for building them more efficiently, for example jSuffixArrays.
Finally you iterate through the sentences of set 1, and in each sentence through the candidate starting positions of relevant tokens (probably only words that follow white space or punctuation) and search the suffix array of set 2 for them. Searching a suffix array when the LCP array is available can be done in O(n+m) time (n is the length of the concatenated string of set 2, m is the length of the candidate string you are looking for) using the classical search algorithm by Manber and Myers, but if that is still too slow, there are refined methods available, e.g. described by Navarro and Mäkinen 2007.
For each match you find, the suffix array can readily provide with information on how frequent the string occurs in set 2, and in how many different sentences. I can elaborate on how to do the latter in an edit to this post if needed.
Upvotes: 2
Reputation: 1641
As I understand you have a list of strings that are unique within each list. These strings are then part (substrings) of the Strings in the lists. I would create a list of such substrings and then compare them using regex (rather no java substring as in this case you would need to know start index).
Upvotes: 1