user2138515
user2138515

Reputation: 43

Working with string data and classification in Weka

I have a data-set that consists of a pair of a string and the class it belongs to. The string is a sentence. The class can either be 'male' or 'female'. An example -

'Hi! My name is Jack', male

I am using this as a training set so that, given a different set of strings it can classify whether that statement came from a male or female. I am using WEKA's stringtowordvector to convert the string to a vector containing the count of words in that string. Using the resultant arff i want it to generate a prediction algorithm (decision trees?) that i can use on an unclassified data-set. How do i go about it? Which classifier should i use? And which other preprocessing techniques would help in this scenario?

Upvotes: 4

Views: 6215

Answers (1)

tdc
tdc

Reputation: 8597

Perhaps a good place to start would be the Simple Message Classifier example (code and wiki) example on the Weka homepage, or maybe the Text Categorization Wiki.

Pretty much any linear classifier would be a good starting place. I'd suggest either Logistic Regression or Support Vector Machines as a good starting point.

Upvotes: 4

Related Questions