Reputation: 1941
I'm pretty new to data mining and ML. I want to understand how different is k-means from LSH. Upon reading few papers and other materials available online, it seems that both algorithms try to achieve grouping / clustering of similar documents. For usecases like spam detection, either of them have been used in many papers. But I'm not very clear how they are different and if at all we use this for a usecase like spam detection, how would the result differ at all?
Upvotes: 1
Views: 3160
Reputation: 77505
LSH doesn't cluster your data.
It is suitable for near-duplicate (!) detection.
LSH is really about "almost the same" objects, not about finding larger structure in your data.
I don't think spam detection is a good use case for either - do you know of any spam filter that would actually do this? The near-duplicate news detection of e.g. Google News is however related to some kind of LSH; supposedly they are using minhashing.
Upvotes: 4