Reputation: 385
I have to execute this line of cose several million times, I wonder if there is a way to optimize it (maybe precomputing something?).
a.contains(b) || b.contains(a)
Thank you
edit: the code executed by the contains method already checks for a.length < b.length.
public static int indexOf(byte[] value, int valueCount, byte[] str, int strCount, int fromIndex) {
byte first = str[0];
int max = (valueCount - strCount);
for (int i = fromIndex; i <= max; i++) {
[...]
}
return -1;
}
Upvotes: 6
Views: 243
Reputation: 82929
As I understand the task, you have to check whether a
contains b
or vice versa for each pair of a
and b
from a set of about 35 million words. That's a lot of pairs to check.
You should be able to narrow the search down considerable by precomputing which n-grams a word contains: If a
contains some n-gram, then b
has to contain the same n-gram if b
contains a
. You could e.g. precompute all the trigrams that each word in the list contains, and at the same time all the words that contain a given trigram, then you can just look up the words in those dictionaries and with some set operations get a small set of candidates to check properly.
In pseudo-code:
Map<String, Set<String>> ngram_to_word
a
in your data set
a
a
to the sets of words containing those n-grams in ngrams_to_words
a
in your data set
a
containsngrams_to_words
b
in that intersection that contains all the n-grams that a
contains (but maybe in a different order or quantity), properly check whether b
contains a
Depending on the number of letters in those n-grams (e.g. bigrams, trigrams, ...), they will be more expensive to pre-compute, in both time and space, but the effect will also be greater. In the simplest case, you could even just precompute which words contain a given letter (i.e. "1-grams"); that should be fast and already considerable narrow down the words to check. Of course, the n-grams should not be shorter than the shortest of the words in the data set, but you could even use two length of n-grams, e.g. use two maps letter_to_words
and trigrams_to_words
.
Upvotes: 3