dpavlin
dpavlin

Reputation: 1430

How to detect duplicate text with some fuzzyness

Some thing ago, I write small script using Text::DeDupe to remove duplicates of blog posts before I have to lay my eyes on them.

After reading Syntactic Clustering of the Web paper on which implementation is based, I would love to have ability to find overlapping documents (e.g. snippets of blogs as opposed to full text, maybe also quotes).

Do you know of any other implementation in C, C++ or perl which I can try out before writing my own?

Upvotes: 3

Views: 2171

Answers (1)

dpavlin
dpavlin

Reputation: 1430

SpotSigs seems to fit my bill just right, here are some references:

The soruce code for this module is hosted on GitHub:

http://github.com/jzawodn/perl-text-spotsig

Upvotes: 2

Related Questions