Reputation:
I am looking to pre-filter search results either using a bloom filter or a bitmask. To give a concrete example:
id,product,description
1,"coke", "A popular soft drink since 1900"
2,"pepsi", "A popular soda, similar to coke"
3,"soda", "A word to describe various soft drinks"
And if the user searched for the word "coke" we would match on product="coke"
for row1 and description(has word)="coke"
.
We have memory constraints so cannot index too many items, but I was thinking of implementing a bitmask based upon the first letter that each row contains. In that way, we can see that the c
are included in rows 1 and 2, but not in row 3, so we wouldn't include that in our search at all.
If we took the first three rows, the "word-starts-with" mask would look like (for the first 3 letters of the alphabet) --
a b c d
1 0 1 1 (row 1 -- coke) -- has c? Y
1 0 1 0 (row 2 -- pepsi) -- has c? Y
1 0 0 1 (row 3 -- soda) -- has c? NO -- SKIP
My question then is two-fold:
a=1
) at one-character only?A few more details about the search requirements:
description like "%drink%"
(full inner search) or description REGEXP '^|\sdrink'
("edge search", search at the start of any word).Upvotes: 1
Views: 711
Reputation: 50127
Your bitmasks are simple Bloom filters: Assuming you care about 26 possible characters, that is a Bloom filter with m = 26 * rowCount
, k = 1
, and the following hash function: hash(firstLetter, rowId) = (firstLetter * rowCount + rowId)
. This is simple to implement, but likely not very efficient, as some letters appear very often (e.g. the character 'e'). Your bitmask needs about 4 bytes per row, which might be OK. For each row, you do a Bloom filter lookup.
It is likely better to use a more sophisticated Bloom filter. How it looks like exactly depends on the data you have. Assuming you use m = 26 * rowCount
, k = 1
, and hash(firstLetter, secondLetter, rowId) = ((11 * firstLetter + 113 * secondLetter) modulo 26) * rowCount + rowId)
. That way, it uses the same space, but bits are more evenly distributed. For frequent letters, that speeds up search a lot, at the cost of slightly slower search for less frequent letters.
Even better might be to combine multiple rows, let's say combine 8 rows each (rows 0..7, 8..15,...), and then set all the bits in the Bloom filter that are needed. That way, you can reduce the number of lookups greatly.
If your queries can be of the form like "%drink%"
, then a filter that only looks at the first characters is not useful: you still have to do a full scan. Instead, you could have a Bloom filter that combines (say) 8 rows, and sets all the bits of each character pair. That is, ['dr', 'ri', 'in', 'nk']
, and use m = 26 * rowCount / 8
, k = 1
, and hash(firstLetter, secondLetter, rowGroup) = ((11 * firstLetter + 113 * secondLetter) modulo 26) * rowCount / 8 + rowGroup), with
rowGroup = rowId / 8`. So you basically check whether a character pair appears in a certain row group. That way, you can use the Bloom filter even for "like" conditions and regular expressions.
Upvotes: 0
Reputation: 7063
If you cannot tolerate false positives, you must not use bloom filter as it is a probabilistic data structure.
For the bitmask approach, apparently, is not time efficient and the approach would be difficult to scale later. As you talk about data size of around 800 MB, you are entering the paradigm of Search or Information Retrieval. The question now doesn't remain confined to 'Bitmasks vs Bloom Filters' Just have a read at the indexing related concepts in Search Engine Indexing, they might help you.
To work around with the common words, please read what stop words are and how to remove them. To go to a bit more next level, if you don't need to find the exact word, read about Stemming and Lemmatization.
The question is quite broad so I just gave a few pointers to read. Hope you find them useful.
Upvotes: 3