Reputation: 436
Currently I work on an application where I have large number of hash values (strings).
When a query hash value (string) is given, the search process goes through those strings and return strings where the Hamming Distance between the query string and the result string is less than a given threshold.
1000302014771944008
"t>25
) and can be vary.I want to implement this search process using an efficient algorithm rather than using brute-force approach.
I have read some research papers (like this & this), but they are for binary strings or for low threshold values. I also tried Locality-sensitive hashing, but implementations I found were focused on binary strings.
Are there any algorithms or data structures to address this problem?
Any suggestions are also welcome. Thank you in advance.
.
Additional Information
Hamming Distance between non-binary strings
string 1: 0014479902266110001131133
string 2: 0014409902226110001111133
-------------------------
1 1 1 = 3 <-- hamming distance
Considered brute-force approach
Upvotes: 0
Views: 1319
Reputation: 115
Read the 7th section of the paper:
"HmSearch: An Efficient Hamming Distance Query Processing Algorithm".
The state-of-art result for d-query problem can be found at:
"Dictionary matching and indexing with errors and don’t care", which solves d-query problem in time O(m+log(nm)^d+occ) using space O(n*log(nm)^d), where occ is the number of query result.
If threshold values is not smal, there are practical solutions for binary strings, found on HmSearch.
I think it is possible to apply the same practical solutions found on HmSearch for arbitrary strings, but I've never seen those solutions.
Upvotes: 1
Reputation: 1221
Something like this could work for you.
http://blog.mafr.de/2011/01/06/near-duplicate-detection/
Upvotes: 0