Mickey Shine
Mickey Shine

Reputation: 12549

Detect duplicated/similar text among large datasets?

I have a large database with thousands records. Every time a user post his information I need to know if there is already the same/similar record. Are there any algorithms or open source implementations to solve this problem?

We're using Chinese, and what 'similar' means is the records have most identical content, might be 80%-100% are the same. Each record will not be too big, about 2k-6k bytes

Upvotes: 3

Views: 3007

Answers (4)

Amit Prakash
Amit Prakash

Reputation: 1221

Look at shngle-min-hash techniques. Here is a presentation that could help you.

Upvotes: 1

John
John

Reputation: 120

One approach I have used to do something similar is to construct a search index in the usual based on word statistics and then use the new item as if it was a search against that index - if the score for the top item in the search is too high then the new item is too similar. No doubt some of the standard text search libraries could be used for this although if it is only a few thousands of records it is pretty trivial to build your own.

Upvotes: 0

OmnipotentEntity
OmnipotentEntity

Reputation: 17131

This answer is of a very high complexity class (worst case it's quintic, expected case it's quartic to verify your database the first time, then quartic/cubic to add a record,) so it doesn't scale well, unfortunately there isn't a much better answer that I can think of right now.

The algorithm is called the Ratcliff-Obershelp algorithm, It's implemented in python's difflib. The algorithm itself is cubic time worst case and quadratic expected. Then you have to do that for each possible pair of records, which is quadratic. When adding a record, of course, this is only linear.

EDIT: Sorry, I misread the documentation, difflib is quadratic only, rather than cubic. Use it rather than the other algorithm.

Upvotes: 1

Related Questions