BigG
BigG

Reputation: 1011

Create a dataset: extract features from text documents (TF-IDF)

I've to create a dataset from some text files, writing them as vectors of features.

Something like this:

doc1: 1,0.45 6,0.001 94,0.1 ...

doc2: 3,0.5 98,0.2 ...

...

each position of the vector represent a word, and the score is given by something like TF-IDF.

Do you know some library/tool/whatever for this? (java is better)

Upvotes: 1

Views: 2785

Answers (3)

BigG
BigG

Reputation: 1011

After some days i found the "perfect tool" for this: Word Vector Tool. http://sourceforge.net/projects/wvtool/

Upvotes: 2

Darknight
Darknight

Reputation: 2500

Sure there are many eg http://en.wikipedia.org/wiki/Lucene

However

I recommend that you write an basic IR system from scratch. Looking under the hood is always a great learning experience.

Upvotes: 0

Yin Zhu
Yin Zhu

Reputation: 17119

mallet. including TF-IDF, POS, classification.

Upvotes: 0

Related Questions