Reputation: 573
I'd like to do some text classification (Naive Bayes) with Weka using the simple cli (command line), but I have one problem. Weka can't handle strings, they have to be converted. But how can I convert the strings in my arff file through cli?
sentences.arff example
@relation data set
@attribute text string
@attribute class {swedish,'?',english}
@data
'detta är en svensk text',swedish
'this is an english text',english
'what is the name of this book?',english
'vilken färg är en liten stuga?',swedish
'you are the best',english
'en enstaka fjäder i hatten fördröjer livet ett tag',swedish
'detta är en annan svensk text',swedish
I'm using the following command to create a model
java weka.classifiers.bayes.NaiveBayes -t data.arff -d data.model
Upvotes: 1
Views: 1143
Reputation: 14701
Use StringToWordVector to change text attributes to numeric values. Most classifiers in weka can not work with text values, see Working with textual data. After that you can use NaiveBayes normally.
java weka.filters.unsupervised.attribute.StringToWordVector -i datasets\sentences.arff > datasets\sentencesWordVector.arff
java weka.classifiers.bayes.NaiveBayes -t datasets\sentencesWordVector.arff -c 1 -x 3
Note that I need to use 3 folds (-x 3) since your example's instance number is shorter than default value 10. I also used class index 1 (-c 1).
In my computer, I get following Confusion Matrix. Quite normal since your examples does not have any instance with '?'.
=== Confusion Matrix ===
a b c <-- classified as
4 0 0 | a = swedish
0 0 0 | b = ?
0 0 3 | c = english
Upvotes: 2