Reputation: 379
Currently i'm studying about data-mining, text comparison and have found this one: https://en.wikipedia.org/wiki/Cosine_similarity.
Since i have successfully implemented this algorithm to compare two strings i have decided to try some more complex task to achieve. I have iterated over my DB which contains about 250k documents and compared one random document from DB to whole documents in that DB.
To compare all these items time was taken: 316.35898590088 sec, that's, - > 5 minutes to compare all 250k documents!
Due this results many issues have arisen and i wan't to ask some suggestions. For clarity first of all i'll describe some details which might be useful.
Questions
Upvotes: 0
Views: 544
Reputation: 77474
Both PHP and MySQL are about the worst choices you could have made.
Efficient cosine similarity is at the heart of Lucene. The key acceleration technique are comoressed inverted indexes. But you really don't want to reimplement them in PHP...
Upvotes: 1