Reputation: 71
I am basically creating a search engine and I want to implement tf*idf to rank my xml documents based on a search query. How do I implement it? How do I start it? Any help appreciated.
Upvotes: 3
Views: 13964
Reputation: 25206
Apache Mahout:
I believe it requires a Hadoop File System, which is a bit of extra work. But it works great.
Upvotes: 1
Reputation: 668
Surprising that the Weka library hasn't been mentioned here. Weka's StringToWordVector class implements TF-IDF.
Upvotes: 2
Reputation: 76171
I did this in the past, and I used Lucene to get the TD*IDF data.
It took fair amount of fiddling aound though, so if there are other solutions people know are easier, then use them.
Start by looking at TermFreqVector and other classes in org.apache.lucene.index.
Upvotes: 1