Reputation: 71

java - tf*idf implementation?

I am basically creating a search engine and I want to implement tf*idf to rank my xml documents based on a search query. How do I implement it? How do I start it? Any help appreciated.

Upvotes: 3

Answers (4)

Sridhar Sarnobat

Reputation: 25314

Apache Mahout:

https://github.com/apache/mahout/blob/master/mr/src/main/java/org/apache/mahout/vectorizer/TFIDF.java

I believe it requires a Hadoop File System, which is a bit of extra work. But it works great.

Upvotes: 1

shark8me

Reputation: 668

Surprising that the Weka library hasn't been mentioned here. Weka's StringToWordVector class implements TF-IDF.

Upvotes: 2

W.P. McNeill

Reputation: 17086

tfidf is a standalone Java package that calculates Tf-Idf.

Upvotes: 1

daveb

Reputation: 76301

I did this in the past, and I used Lucene to get the TD*IDF data.

It took fair amount of fiddling aound though, so if there are other solutions people know are easier, then use them.

Start by looking at TermFreqVector and other classes in org.apache.lucene.index.

Upvotes: 1

java - tf*idf implementation?

Answers (4)

Related Questions