Bitmasks vs bloom filters

Question

I am looking to pre-filter search results either using a bloom filter or a bitmask. To give a concrete example:

id,product,description
1,"coke", "A popular soft drink since 1900"
2,"pepsi", "A popular soda, similar to coke"
3,"soda", "A word to describe various soft drinks"

And if the user searched for the word "coke" we would match on product="coke" for row1 and description(has word)="coke".

We have memory constraints so cannot index too many items, but I was thinking of implementing a bitmask based upon the first letter that each row contains. In that way, we can see that the c are included in rows 1 and 2, but not in row 3, so we wouldn't include that in our search at all.

If we took the first three rows, the "word-starts-with" mask would look like (for the first 3 letters of the alphabet) --

a  b  c  d
1  0  1  1 (row 1 -- coke)  -- has c? Y
1  0  1  0 (row 2 -- pepsi) -- has c? Y
1  0  0  1 (row 3 -- soda)  -- has c? NO -- SKIP

My question then is two-fold:

For a scenario such as the above, would there be any advantage is using a bloom filter over a bitmask? Why or why not? (I'm not too familiar with bloom filters and have never used one myself).
Does the above one-letter bitmask seem like it would be useful, or does it seem like it wouldn't solve anything really (for example, every row could have a=1) at one-character only?
Are there suggested ways to work around common letters/words. For example, "a/an", "the", etc. seem like they would occur in almost all columns with natural text.

A few more details about the search requirements:

The maximum data size would be 1GB. This would translate to anywhere between 1M-10M rows depending on the size of the rows.
There is very, very little additional room available, so something like a traditional index would be out of the question. For reference, let's suppose there is 10% headroom on any data set to store ancillary information, such as a bitmask/filter/index/etc.
Two example queries would be description like "%drink%" (full inner search) or description REGEXP '^|\sdrink' ("edge search", search at the start of any word).

Thomas Mueller · Accepted Answer

Your bitmasks are simple Bloom filters: Assuming you care about 26 possible characters, that is a Bloom filter with m = 26 * rowCount, k = 1, and the following hash function: hash(firstLetter, rowId) = (firstLetter * rowCount + rowId). This is simple to implement, but likely not very efficient, as some letters appear very often (e.g. the character 'e'). Your bitmask needs about 4 bytes per row, which might be OK. For each row, you do a Bloom filter lookup.

It is likely better to use a more sophisticated Bloom filter. How it looks like exactly depends on the data you have. Assuming you use m = 26 * rowCount, k = 1, and hash(firstLetter, secondLetter, rowId) = ((11 * firstLetter + 113 * secondLetter) modulo 26) * rowCount + rowId). That way, it uses the same space, but bits are more evenly distributed. For frequent letters, that speeds up search a lot, at the cost of slightly slower search for less frequent letters.

Even better might be to combine multiple rows, let's say combine 8 rows each (rows 0..7, 8..15,...), and then set all the bits in the Bloom filter that are needed. That way, you can reduce the number of lookups greatly.

If your queries can be of the form like "%drink%", then a filter that only looks at the first characters is not useful: you still have to do a full scan. Instead, you could have a Bloom filter that combines (say) 8 rows, and sets all the bits of each character pair. That is, ['dr', 'ri', 'in', 'nk'], and use m = 26 * rowCount / 8, k = 1, and hash(firstLetter, secondLetter, rowGroup) = ((11 * firstLetter + 113 * secondLetter) modulo 26) * rowCount / 8 + rowGroup), withrowGroup = rowId / 8`. So you basically check whether a character pair appears in a certain row group. That way, you can use the Bloom filter even for "like" conditions and regular expressions.

Bitmasks vs bloom filters

Answers (2)

Related Questions