Reputation: 19468
I am starting with a collection C
of target sets. Each set contains words (or strings without spaces). As I iterate over sentences, I can consider the sentence as an observed set of words D
. My problem is that, for each sentence, I want to find all of the sets S
in C
such that D
contains S
. In other words, I want to find all sets of words in C
for which all of their words are in the sentence.
For example, consider the following contents for C
:
{fell, ate}
{cat, snail, tiger}
{tree, bush, jalepeno}
{pen, paperclip, stapler}
Now if I see the sentence "The tree fell over on the bush because it ate a jalepeno.", I would want to return the follwing two sets.
{fell, ate}
{tree, bush, jalepeno}
This seems somewhat similar to a trie. However, it is only similar because I am not matching on all the words in the sentence, but rather any subset. One idea was to represent the collection C
in a trie-like data structure with a Map[String, PseudoTrie]
at each level, and follow multiple paths through it as I iterate over the words in the sentence in sorted order.
I understand this is similar to a search query. However, it's fine if I need to iterate over all the sentences, so long as the computation for each sentence is fast. Iterating over each set in C
and performing a contains is too slow.
UPDATE
I thought people might be interested in some of the results I had. These timings are on 100000 sentences and each target set C
in D
has about 4 or 5 words.
Original application. Iterate over all target sets and see if they are contained in the sentence, where the sentence is represented by a set of words. 227 s
Filtering by vocabulary. The same as the original process except sentences are represented by the set of words in that sentence that are also in some target set. The advantage is we only need to consider a subset of the words in a sentence when doing the comparisons. 110 s
Strings are translated to an integer key starting from 0. This also includes the filtering in trial #2 by necessity. 86 s
Add an inverted index. 4 s
I also tried the BitSets with an inverted index. It took 20 s, so I didn't stick with them. I'm not sure why the slowdown--I may have done something wrong.
Also, I think my original idea is great (many layers of inverted indices) but it's rather complicated, and I have a pretty good speedup already!
Upvotes: 9
Views: 572
Reputation: 25042
An inverted index can be quite useful for this sort of problem. As a starting point, consider creating a mapping from the words in C
to a list of all the sets containing that word, thus having type Map[String, List[Set[String]]]
; this is the inverted index.
Using the inverted index, you can find the sets which are contained in D
, without ever checking those sets which have an empty intersection with D
. Just iterate through the list of sets for each of the distinct words in D
, keeping track of how many times each set is encountered. Compare the counts to the lengths of the sets; a set S
is a subset of D
if and only if the count for S
equals the number of elements in S
.
This approach speeds up the search by eliminating checks of those sets which do not intersect D
at all. You can extend the idea to eliminate even more checks by using an index from two-word sets to lists of sets containing both words. Now, sets which only have one word in common with D
will not be checked (so sets with just one word need to be treated separately!). It becomes necessary to iterate through all two-elements subsets of D
, comparing the counts to the number of two-element subsets of each set S
, but otherwise is the same.
Even larger subsets can be used as keys in the index, but at some point you'll be generating more possible keys than the number of operations that will be saved. The optimal choice will depend on the specifics of C
and the set of sentences.
Upvotes: 2
Reputation: 19601
You can often speed up the underlying comparisons by creating a dictionary giving each word a number, and then switching from comparisons between strings to comparisons between numbers.
One easy approach would be to pick one word at random from each set and then create a dictionary mapping each word to a list of sets from which it was the chosen word. Then, given a sentence, look up each word in it in the dictionary and see if any of the lists of sets are contained in the sentence.
You may be able to detect when a set is not a subset of a sentence quickly. Create a sparse 64-bit bit pattern for each word, and represent each set as the or of the bit patterns of each word in it. Represent a sentence as the or of all its words. Then if set &~sentence != 0, the set is not contained in the sentence. If a set fails this test it is not a subset. If it passes it, unfortunately, it may still not be a subset and you will have to use a slower test to confirm this, but if enough sets fail at the first hurdle you may save time. As a rule of thumb, I would make each 64-bit pattern have k randomly chosen bits set, choosing k such that the 64-bit patterns representing sentences have about half their bits set - but you can probably work out a better target on the back of an envelope with a little thought. If you get to this test only after finding a particular word in a sentence, you can of course not include this word in the sets you create, taking its presence for granted.
Suppose a word sets each bit in our 64-bit bitmap independently, and fails to set it with probability x. Then in a sentence of n words, a bit is not set in the bitmap for that sentence with probability x^n. For a set with k words, we can discard on the basis of that bit if it is set by the sentence but not by the word, which happens with probability (1-x^k)x^n. If I differentiate this I get a maximum at x = (n/(n+k))^(1/k). If I set n = 20 k = 4 then I want x = 0.9554 and bits are clear in a sentence about 40% of the time, and a single bit discards about 7% of the time. But we have 64 bits which are pretty much independent on this model, so we discard complete mismatches over 98% of the time.
Upvotes: 1
Reputation: 139038
We'll start with the corpus of sentences you want to search against:
val corpus = Seq(
Set("fell", "ate"),
Set("cat", "snail", "tiger"),
Set("tree", "bush", "jalapeno"),
Set("pen", "paperclip", "stapler")
)
One fairly efficient way of representing this is as a table of bits, with vocabulary types as the columns and sentences as the rows. We define a couple of functions for converting to this representation:
import scala.collection.immutable.BitSet
def vocabulary(c: Seq[Set[String]]) = c.reduce(_ union _).zipWithIndex.toMap
def convert(s: Set[String], v: Map[String, Int]) = (BitSet.empty /: s) {
(b, w) => v.get(w).map(b + _).getOrElse(b)
}
And a function to search a corpus c
for all sentences that a given sentence s
contains:
def search(s: BitSet, c: Seq[BitSet]) = c.filter(x => (x & s) == x)
This is going to be pretty damn fast, since it's just a bitwise "and" and an equality comparison for each sentence in the corpus. We can test:
val vocab = vocabulary(corpus)
val table = corpus.map(convert(_, vocab))
val sentence = convert(
"The tree fell over on the bush because it ate a jalapeno".split(" ").toSet,
vocab
)
val result = search(sentence, table)
Which gives us List(BitSet(2, 6), BitSet(5, 7, 10))
. To confirm that this is what we wanted:
val bacov = vocab.map(_.swap).toMap
result.map(_.map(bacov(_)))
This is List(Set(fell, ate), Set(jalapeno, tree, bush))
, as desired.
Upvotes: 5