Reputation: 53
I have a frame that stores text reviews in column A and ratings (1 to 5) in column B.
id .....review ..............rating
1 .....That was awful ......1...
I need to create a simple (any algorithm-based) classifier, for example, based on features like word:occurrances vocabulary, that would predict if rating > 3 or < 3 (let's say we'd add another col with 1 if rating > 3 and 0 if <)
I'm not good at Python and machine learning, so I got stuck on all samples I've googled.
Please, explain, how to extract features in that sample case, how to train a model and so on, or provide a good tutor for that case (I'm not able to translate sklearn tutor to my case).
Upvotes: 3
Views: 3002
Reputation: 1468
You can do this extremely easily in scikit.
Let's say you have X and y data:
X = ['the food was really delicious', 'the food was really terrible']
y = [5,2]
Using CountVectorizer
you can convert the data into numbers in 2 lines of code:
from sklearn.feature_extraction.text import CountVectorizer
x_data = CountVectorizer().fit_transform(X)
This fully converts your data into counts, and can then feed into whichever algorithm you want:
from sklearn.neighbors import KNeighbors
clf = KNeighbors().fit(x_data, y)
Upvotes: 5
Reputation: 38177
There are about two general steps, which could be explained in great detail.
First, you need to determine which features to use. This is one of the main tasks, and is up to you. The standard approach is a bag-of-words model. This counts the occurrence of each word in each text. It is
quite simplistic but surprisingly useful in practice
There are also specialized tools that do a tf-idf analysis for you, for example Sally.
Let's assume you want to do this in Python using scikit-learn. The data is already available as a class Review(object)
with text
and rating
attributes. From the text
, you need to extract features.
Example:
def extract(review):
'''extracts features from review'''
result = {}
for word in review.text.split():
if result[word] is not None:
result[word] += 1
else:
result[word] = 1
return result
would give you a count of all words in text (there is also a library class Counter
, which might do that for you). These, you could combine to form a feature matrix X
. (This code could be heavily optimized)
X = []
y = []
words = []
# build an index of all occurring words
for review in reviews:
for word in extract(review):
if word not in words:
words.append(word)
# creates the feature vectors for classification
for review in reviews:
feature_vector = [0] * len(words)
y.append(review.rating)
for word, count in extract(review):
feature_vector[words.index(word)] = count
X.append(feature_vector)
Now you have got the feature vector, you need to decide which classifier to use. Among the easiest is k-nearest-neighbors.
from sklearn import neighbors, cross_validation
X_train, X_test, y_train, y_test = cross_validation.train_test_split(
X, y, test_size=0.33, random_state=42)
knn = neighbors.KNeighborsClassifier()
knn.fit(X_train, y_train)
knn.predict(X_test)
Compare this to y_test
.
Let's consider an example of two reviews:
- that was awful | rating 1;
- that was great | rating 5.
Two dicts are created:
{'that': 1, 'was': 1, 'awful': 1 }
and{'that': 1, 'was': 1, 'great': 1}
. And what X and y vectors should look like in that case?
Firstly, your words
might be ['that', 'was', 'awful', 'great']
.
Then, you might get
X = [[1, 1, 1, 0],
[1, 1, 0, 1]]
y = [1, 5]
Upvotes: 2