SPWorley
SPWorley

Reputation: 11878

Data structure for finding nearby keys with similar bitvalues

I have some data, up to a between a million and a billion records, each which is represented by a bitfield, about 64 bits per key. The bits are independent, you can imagine them basically as random bits.

If I have a test key and I want to find all values in my data with the same key, a hash table will spit those out very easily, in O(1).

What algorithm/data structure would efficiently find all records most similar to the query key? Here similar means that most bits are identical, but a minimal number are allowed to be wrong. This is traditionally measured by Hamming distance., which just counts the number of mismatched bits.

There's two ways this query might be made, one might be by specifying a mismatch rate like "give me a list of all existing keys which have less than 6 bits that differ from my query" or by simply best matches, like "give me a list of the 10,000 keys which have the lowest number of differing bits from my query."

You might be temped to run to k-nearest-neighbor algorithms, but here we're talking about independent bits, so it doesn't seem likely that structures like quadtrees are useful.

The problem can be solved by simple brute force testing a hash table for low numbers of differing bits. If we want to find all keys that differ by one bit from our query, for example, we can enumerate all 64 possible keys and test them all. But this explodes quickly, if we wanted to allow two bits of difference, then we'd have to probe 64*63=4032 times. It gets exponentially worse for higher numbers of bits.

So is there another data structure or strategy that makes this kind of query more efficient? The database/structure can be preprocessed as much as you like, it's the query speed that should be optimized.

Upvotes: 10

Views: 802

Answers (13)

bayer
bayer

Reputation: 6894

If you are okay with a randomized algorithm (monte carlo in this case), you can use the minhash.

Upvotes: 0

Alex Brown
Alex Brown

Reputation: 42872

Assuming you have to visit each row to test its value (or if you index on the bitfield then each index entry), then you can write the actual test quite efficiently using

A xor B

To find the difference bits, then bit-count the result, using a technique like this.

This effectively gives you the hamming distance.

Since this can compile down to tens of instructions per test, this can run pretty fast.

Upvotes: 0

bill
bill

Reputation:

Upvotes: 0

ilya n.
ilya n.

Reputation: 18816

If you're ok with doing it probabilistically, I think there's a good way to solve question 2. I assume you have 2^30 data and cutoff and you want to find all points within cutoff distance from test.

One_Try()
    1. Generate randomly a 20-bit subset S of 64 bits
    2. Ask for a list of elements that agree with test on S (about 2^10 elements)
    3. Sort that list by Hamming distance from test 
    4. Discard the part of list after cutoff

You repeat One_Try as much as you need while merging the lists. The more tries you have, the more points you find. For example, if x is within 5 bits, you'll find it in one try with about (2/3)^5 = 13% probability. Therefore if you repeat 100 tries you find all but roughly 10^{-6} of such x. Total time: 100*(1000*log 1000).

The main advantage of this is that you're able to output answers to question 2 as you proceed, since after the first few tries you'll certainly find everything within distance not more than 3 bits, etc.

If you have many computers, you give each of them several tries, since they are perfectly parallelizable: each computer saves some hash tables in advance.

Upvotes: 0

Ray
Ray

Reputation: 3109

The database/structure can be preprocessed as much as you like

Well...IF that is true. Then all you need is a similarity matrix of your hamming distances. Make the matrix sparse by pruning out large distances. It doesn't get any faster and not that much of a memory hog.

Upvotes: 1

Nick Johnson
Nick Johnson

Reputation: 101139

What you want is a BK-Tree. It's a tree that's ideally suited to indexing metric spaces (your problem is one), and supports both nearest-neighbour and distance queries. I wrote an article about it a while ago.

BK-Trees are generally described with reference to text and using levenshtein distance to build the tree, but it's straightforward to write one in terms of binary strings and hamming distance.

Upvotes: 6

David Johnstone
David Johnstone

Reputation: 24450

I haven't completely thought this through, but I have an idea of where I'd start.

You could divide the search space up into a number of buckets where each bucket has a bucket key and the keys in the bucket are the keys that are more similar to this bucket key than any other bucket key. To create the bucket keys, you could randomly generate 64 bit keys and discard any that are too close to any previously created bucket key, or you could work out some algorithm that generates keys that are all dissimilar enough. To find the closest key to a test key, first find the bucket key that is closest, and then test each key in the bucket. (Actually, it's possible, but not likely, for the closest key to be in another bucket - do you need to find the closest key, or would a very close key be good enough?)

Upvotes: 0

PeterAllenWebb
PeterAllenWebb

Reputation: 10408

Create a binary tree (specifically a trie) representing each key in your start set in the following way: The root node is the empty word, moving down the tree to the left appends a 0 and moving down the right appends a 1. The tree will only have as many leaves as your start set has elements, so the size should stay manageable.

Now you can do a recursive traversal of this tree, allowing at most n "deviations" from the query key in each recursive line of execution, until you have found all of the nodes in the start set which are within that number of deviations.

Upvotes: 3

Harper Shelby
Harper Shelby

Reputation: 16585

If the data weren't so sparse, a graph with keys as the vertices and edges linking 'adjacent' (Hamming distance = 1) nodes would probably be very efficient time-wise. The space would be very large though, so in your case, I don't think it would be a worthwhile tradeoff.

Upvotes: -1

Darius Bacon
Darius Bacon

Reputation: 15124

"Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions", from 2008, seems to be the best result as of then. I won't try to summarize since I read it over a year ago and it's hairy. That's from a page on locality-sensitive hashing, along with an implementation of an earlier version of the scheme. For more general pointers, read up on nearest neighbor search.

This kind of question has been asked before: Fastest way to find most similar string to an input?

Upvotes: 1

Apocalisp
Apocalisp

Reputation: 35054

This sounds like a good fit for an S-Tree, which is like a hierarchical inverted file. Good resources on this topic include the following papers:

Hierarchical Bitmap Index: An Efficient and Scalable Indexing Technique for Set-Valued Attributes.

Improved Methods for Signature-Tree Construction (2000)

Quote from the first one:

The hierarchical bitmap index efficiently supports dif- ferent classes of queries, including subset, superset and similarity queries. Our experiments show that the hierarchical bitmap index outperforms other set indexing techniques significantly.

These papers include references to other research that you might find useful, such as M-Trees.

Upvotes: 3

Jay Kominek
Jay Kominek

Reputation: 8783

I'd go with an inverted index, like a search engine. You've basically got a fixed vocabulary of 64 words. Then similarity is measured by hamming distance, instead of cosine similarity like a search engine would want to use. Constructing the index will be slow, but you ought to be able to query it with normal search enginey speeds.

The book Introduction to Information Retrieval covers the efficient construction, storage, compression and querying of inverted indexes.

Upvotes: 1

Martin Hock
Martin Hock

Reputation: 964

Well, you could insert all of the neighbor keys along with the original key. That would mean that you store (64 choose k) times as much data, for k differing bits, and it will require that you decide k beforehand. Though you could always extend k by brute force querying neighbors, and this will automatically query the neighbors of your neighbors that you inserted. This also gives you a time-space tradeoff: for example, if you accept a 64 x data blowup and 64 times slower you can get two bits of distance.

Upvotes: 0

Related Questions