ubuntunoob
ubuntunoob

Reputation: 249

Text classification using Weka

I'm a beginner to Weka and I'm trying to use it for text classification. I have seen how to StringToWordVector filter for classification. My question is, is there any way to add more features to the text I'm classifying? For example, if I wanted to add POS tags and named entity tags to the text, how would I use these features in a classifier?

Upvotes: -1

Views: 577

Answers (1)

Jose Maria Gomez Hidalgo
Jose Maria Gomez Hidalgo

Reputation: 1061

It depends of the format of your dataset and the preprocessing steps you perform. For instance, let us suppose that you have pre-POS-tagged your texts, looking like:

The_det dog_n barks_v ._p

So you can build an specific tokenizer (see weka.core.tokenizers) to generate two tokens per word, one would be "The" and the other one would be "The_det" so you keep the tag information.

If you want only tagged words, then you can just ensure that "_" is not a delimiter in the weka.core.tokenizers.WordTokenizer.

My advice is to have both the words and tagged words, so a simpler way would be to write an script that joins the texts and the tagged texts. From a file containing "The dog barks" and another one cointaining "The_det dog_n barks_v ._p", it would generate a file with "The The_det dog dog_n barks barks_v . ._p". You may even forget about the order unless you are going to make use of n-grams.

Upvotes: 2

Related Questions