Drocchio
Drocchio

Reputation: 413

How can I solve a classification problem with a dependent variable with more than two values

I have a simple NLP problem, where I have some written reviews that have a simple binary positive or negative judgement. In this case I am able to train and test as independent variables the columns of X that contain the "bags of words", namely the single words in a sparse matrix.

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 300)
#indipendent
X = cv.fit_transform(corpus).toarray()
#dependent
y = dataset.iloc[:, 1].values

..and the dependent variable y, that is represented by the column 1 that assume values as 0 and 1( so basically positive and negative review).

if instead of 0 and 1, I have reviews that can be voted from 1 to 5 stars should I proceed having an y variable column with values from 0 to 4?In other words I would lie to know how differ the model if instead of a binary good/bad review, the user has the possibility after his or her review to give a rating from 1 to 5. How is called this kind of problem in machine learning?

Upvotes: 1

Views: 1633

Answers (3)

Venkatachalam
Venkatachalam

Reputation: 16966

This problem is called as multi-class classification problem as mentioned by @rishi. There is a large variety of algorithms that can solve the multi-class problem. Look here

You could make your target variable as one, which as the ratings.

#dependent
y = dataset.iloc[:, 'ratings'].values

Then, you can fit this data into the classifier!

from sklearn import linear_model
clf = linear_model.SGDClassifier()
clf.fit(X, y)

Upvotes: 1

rishi
rishi

Reputation: 2554

It is just multi-class classification problem. Here is a sample code from where you can get an idea. What you are calling 'dependent variable' is called class (class that the input example belongs to)

    label_idx = [unique.index(l) for l in labels] """ labels= class. works for your class is string or so. 
here labels can be more than two"""
    label_idx = np.array(label_idx) # just get your class into array
    vectors = np.array(vecs) # vecs are any vectorised form of your text data
    clf = LinearSVC() # classifier of your choice
    clf.fit(vectors, label_idx)

Upvotes: 2

ilearn
ilearn

Reputation: 193

I have used the following link for a RandomForest multiClassifier which is one of many possible ML algorithms you can use:

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier

However, my personal experience shows deep learning neural networks work better with "text data" and tree-based models are better for tabular data with numeric values.

Upvotes: 1

Related Questions