London guy
London guy

Reputation: 28012

How to create a bag of words using Weka?

I have a corpus of documents and I want to represent each document as a vector. Basically, the vector would have 1 for words that are present inside a document and for other words (which are present in other documents in the corpus and not in this particular document) it would have a 0. How do I create this vector for all the documents in Weka?

Is there a quick way to do this using Weka? I also want Weka to remove stopwords and so some pre-processing if possible before it creates this vector.

Thanks Abhishek S

Upvotes: 5

Views: 5746

Answers (1)

michaeltwofish
michaeltwofish

Reputation: 4086

You want the StringToWordVector filter.

It has options for binary occurrence and stopping, amongst many others, such as stemming, truncating the word list, discarding infrequent terms, case folding.

Upvotes: 8

Related Questions