wehriam
wehriam

Reputation: 204

Duplicate text detection / hashing

I have sets of strings in a database. Each set will have less than 500 members, there will be tens of thousands of sets, and the strings are natural language. I would like to detect duplicate strings within each set. New strings will be compared with an existing set, and added to the database if they are unique.

Are there hashing algorithms that would be effective at finding (very) similar strings? For example, the strings probably would have the same number of words, but the encoding may be slightly different (UTF-8 vs Latin-1).

Upvotes: 3

Views: 3895

Answers (7)

I believe Locally Sensitive Hashing (LSH) is what you need: [LSH][1]https://onestopdataanalysis.com/lsh/

It is a really fast implementation for a near-duplicates text search.

Upvotes: 0

user1417684
user1417684

Reputation: 2604

This post on my blog may be of interest.

A description of the algorithm and link to the code is provided. In short, it's an n-grams based approach that makes no assumption about the content or structure of the input, and generates constant length signatures for all input documents.

Upvotes: 1

si28719e
si28719e

Reputation: 2165

you could get crazy and try latent semantic analysis/mapping and the singular value decomposition: latent semantic mapping

together with SVDLIBC, which is pretty easy to get going with.

Upvotes: 0

Van Gale
Van Gale

Reputation: 43922

This might be overkill, but you might want to try NLTK (Natural Language Toolkit), which is Python based.

One feature that might be useful is to analyze sentence structure. Of course that might lead to some strings being marked as duplicate because they have the same grammar structure, but different words and meaning.

You might also be able to use probability and classification features.

Upvotes: 0

tom10
tom10

Reputation: 69242

The short answer is just guess at what a good hash parameter would be that matches your ideas of "similar".

Probably just something like sum of all the letters (A) and the sum of the differences between adjacent letters (B), might work. For each new string, use its A and B values to quickly lookup a now much smaller set of similar strings, and then do a more careful comparison between these.

This probably isn't the purest solution, but practically, lots of problems are solved this way. Beyond this, I think there is currently quite a bit of work solving similar problems in genetics (i.e. finding similar gene sequences within huge databases), but I don't think there's an accepted generic solution to this problem.

Upvotes: 1

Laurence Gonsalves
Laurence Gonsalves

Reputation: 143294

For starters, you should probably do some sort of normalization. You should probably convert all of your text to a single encoding (eg: UTF-8). You may also want to do case-folding, other Unicode normalizations and perhaps also sorting each set (depending on how you're storing them).

It's unclear (to me) from your question whether you want to find exact matches or just string sets that are "similar". If you only care about exact matches once the normalization is taken into account, then you're pretty much done. Just have an index on the normalized forms of your string sets and you can look up new sets quickly by normalizing them as well.

If you want to find near matches then you'll probably want to do some sort of similarity hashing. The Wikipedia article on Locality Sensitive Hashing describes a number of techniques.

The basic idea behind a number of these techniques is to compute a handful of very lossy hashes on each string, h[0] through h[n]. To look up a new string set you'd compute its hashes and look each of these up. Anything that gets at least one match is "similar", and the more matches the more similar it is (and you can choose what threshhold to cut things off at).

Upvotes: 3

Mr Fooz
Mr Fooz

Reputation: 111946

If there are only 500 strings in the database, perhaps you can directly compare to each one. First convert to a standard representation (say UTF-16). The Levenshtein distance can be a nice way of comparing the similarity of two strings.

Upvotes: 1

Related Questions