zwang
zwang

Reputation: 705

Weka could not use string as attribute to classify text

I have a classification task which takes a string as input and classifies it to some labels. The training data like:

Text1: label_1
Text2: label_2
Text3: label_1

When I use weka, lots of classifies give the exception:

weka.core.UnsupportedAttributeTypeException: weka.classifiers.functions.MultilayerPerceptron: Cannot handle string attributes!
    at weka.core.Capabilities.test(Capabilities.java:979)
    at weka.core.Capabilities.test(Capabilities.java:868)
    at weka.core.Capabilities.test(Capabilities.java:1084)
    at weka.core.Capabilities.test(Capabilities.java:1022)
    at weka.core.Capabilities.testWithFail(Capabilities.java:1301)

Upvotes: 0

Views: 3413

Answers (1)

amit
amit

Reputation: 178411

It is hard to understand what exactly you are trying to achieve, but in Machine Learning, most classifiers are looking for numeric/binary attributes, and not string attributes.

One thing you can do is convert your feature space to numeric/binary attributes using some model. The Bag of Words model is a common solution.

According to this model, what you have to do is:

  1. Iterate over ALL "features" (strings) in your database, assign a number/feature for each string/word
  2. For each classified example, create a new instance with a modified feature space - for each word/string, you now have a number (from step 1), so set the attribute matching this number to the number of occurances of this word in the text. The labels remain the same
  3. Run the learning algorithm on the modified example with the new (numeric) feature space
  4. During classification, if you encounter a word that is not recognized (you didn't have it previously and do not have an attribute number assigned to it) - you can either silently ignore it, or use some heuristic to predict if it is somehow connected to a word you did see. For starters, I'd just ignore it, and come back for this step later for later optimizations.

Upvotes: 4

Related Questions