deividaspetraitis
deividaspetraitis

Reputation: 379

Cosines similarity on large data sets

Currently i'm studying about data-mining, text comparison and have found this one: https://en.wikipedia.org/wiki/Cosine_similarity.

Since i have successfully implemented this algorithm to compare two strings i have decided to try some more complex task to achieve. I have iterated over my DB which contains about 250k documents and compared one random document from DB to whole documents in that DB.

To compare all these items time was taken: 316.35898590088 sec, that's, - > 5 minutes to compare all 250k documents!

Due this results many issues have arisen and i wan't to ask some suggestions. For clarity first of all i'll describe some details which might be useful.

Questions

Upvotes: 0

Views: 544

Answers (1)

Has QUIT--Anony-Mousse
Has QUIT--Anony-Mousse

Reputation: 77474

Both PHP and MySQL are about the worst choices you could have made.

Efficient cosine similarity is at the heart of Lucene. The key acceleration technique are comoressed inverted indexes. But you really don't want to reimplement them in PHP...

Upvotes: 1

Related Questions