coder
coder

Reputation: 1941

k-means versus LSH algorithm

I'm pretty new to data mining and ML. I want to understand how different is k-means from LSH. Upon reading few papers and other materials available online, it seems that both algorithms try to achieve grouping / clustering of similar documents. For usecases like spam detection, either of them have been used in many papers. But I'm not very clear how they are different and if at all we use this for a usecase like spam detection, how would the result differ at all?

Upvotes: 1

Views: 3160

Answers (1)

Has QUIT--Anony-Mousse
Has QUIT--Anony-Mousse

Reputation: 77505

LSH doesn't cluster your data.

It is suitable for near-duplicate (!) detection.

  1. LSH by design may produce "false positives" (hash collissions) that are not similar at all.
  2. LSH has a threshold t, and it only tries to produce hash collissions for objects below this threshold. For good performance, you need to choose this threshold as small as possible. For clustering, you do need to be able to find objects outside of your bucket (farther away than t) - you can't do this reliably with LSH.
  3. LSH will place bucket boundaries randomly; the only reason why you don't notice this that much is that you do this multiple times, and hope that not all of them are badly chosen. So you only get almost all close neighbors. Maybe even only 90%, depending on your parameters. As every object is in multiple buckets, what would be its cluster? You get a huge amount of overlapping 'clusters' that each only contain some parts of your data. It's all but clear how to efficiently find good clusters from this.

LSH is really about "almost the same" objects, not about finding larger structure in your data.

I don't think spam detection is a good use case for either - do you know of any spam filter that would actually do this? The near-duplicate news detection of e.g. Google News is however related to some kind of LSH; supposedly they are using minhashing.

Upvotes: 4

Related Questions