Reputation: 413
I have a simple NLP problem, where I have some written reviews that have a simple binary positive or negative judgement. In this case I am able to train and test as independent variables the columns of X that contain the "bags of words", namely the single words in a sparse matrix.
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 300)
#indipendent
X = cv.fit_transform(corpus).toarray()
#dependent
y = dataset.iloc[:, 1].values
..and the dependent variable y, that is represented by the column 1 that assume values as 0 and 1( so basically positive and negative review).
if instead of 0 and 1, I have reviews that can be voted from 1 to 5 stars should I proceed having an y variable column with values from 0 to 4?In other words I would lie to know how differ the model if instead of a binary good/bad review, the user has the possibility after his or her review to give a rating from 1 to 5. How is called this kind of problem in machine learning?
Upvotes: 1
Views: 1633
Reputation: 16966
This problem is called as multi-class classification problem as mentioned by @rishi. There is a large variety of algorithms that can solve the multi-class problem. Look here
You could make your target variable as one, which as the ratings.
#dependent
y = dataset.iloc[:, 'ratings'].values
Then, you can fit this data into the classifier!
from sklearn import linear_model
clf = linear_model.SGDClassifier()
clf.fit(X, y)
Upvotes: 1
Reputation: 2554
It is just multi-class classification problem. Here is a sample code from where you can get an idea. What you are calling 'dependent variable' is called class (class that the input example belongs to)
label_idx = [unique.index(l) for l in labels] """ labels= class. works for your class is string or so.
here labels can be more than two"""
label_idx = np.array(label_idx) # just get your class into array
vectors = np.array(vecs) # vecs are any vectorised form of your text data
clf = LinearSVC() # classifier of your choice
clf.fit(vectors, label_idx)
Upvotes: 2
Reputation: 193
I have used the following link for a RandomForest multiClassifier which is one of many possible ML algorithms you can use:
However, my personal experience shows deep learning neural networks work better with "text data" and tree-based models are better for tabular data with numeric values.
Upvotes: 1