Sergei Hudozhnin
Sergei Hudozhnin

Reputation: 53

Having text data stored in pandas frame, how to implement simple classification with sklearn

I have a frame that stores text reviews in column A and ratings (1 to 5) in column B.

id .....review ..............rating          
1  .....That was awful ......1...

I need to create a simple (any algorithm-based) classifier, for example, based on features like word:occurrances vocabulary, that would predict if rating > 3 or < 3 (let's say we'd add another col with 1 if rating > 3 and 0 if <)

I'm not good at Python and machine learning, so I got stuck on all samples I've googled.

Please, explain, how to extract features in that sample case, how to train a model and so on, or provide a good tutor for that case (I'm not able to translate sklearn tutor to my case).

Upvotes: 3

Views: 3002

Answers (2)

Bunny_Ross
Bunny_Ross

Reputation: 1468

You can do this extremely easily in scikit.

Let's say you have X and y data:

X = ['the food was really delicious', 'the food was really terrible']
y = [5,2]

Using CountVectorizer you can convert the data into numbers in 2 lines of code:

from sklearn.feature_extraction.text import CountVectorizer
x_data = CountVectorizer().fit_transform(X)    

This fully converts your data into counts, and can then feed into whichever algorithm you want:

from sklearn.neighbors import KNeighbors
clf = KNeighbors().fit(x_data, y)

Upvotes: 5

serv-inc
serv-inc

Reputation: 38177

There are about two general steps, which could be explained in great detail.

Feature Extraction

First, you need to determine which features to use. This is one of the main tasks, and is up to you. The standard approach is a bag-of-words model. This counts the occurrence of each word in each text. It is

quite simplistic but surprisingly useful in practice

There are also specialized tools that do a tf-idf analysis for you, for example Sally.

Let's assume you want to do this in Python using scikit-learn. The data is already available as a class Review(object) with text and rating attributes. From the text, you need to extract features.

Example:

def extract(review):
    '''extracts features from review'''
    result = {}
    for word in review.text.split():
        if result[word] is not None:
            result[word] += 1
        else:
            result[word] = 1
    return result

would give you a count of all words in text (there is also a library class Counter, which might do that for you). These, you could combine to form a feature matrix X. (This code could be heavily optimized)

X = []
y = []
words = []
# build an index of all occurring words
for review in reviews:
    for word in extract(review):
        if word not in words:
            words.append(word)
# creates the feature vectors for classification
for review in reviews:
    feature_vector = [0] * len(words)
    y.append(review.rating)
    for word, count in extract(review):
        feature_vector[words.index(word)] = count
    X.append(feature_vector)

Classification

Now you have got the feature vector, you need to decide which classifier to use. Among the easiest is k-nearest-neighbors.

from sklearn import neighbors, cross_validation
X_train, X_test, y_train, y_test = cross_validation.train_test_split(
    X, y, test_size=0.33, random_state=42)
knn = neighbors.KNeighborsClassifier()
knn.fit(X_train, y_train)
knn.predict(X_test)

Compare this to y_test.

Example from comments (slightly edited)

Let's consider an example of two reviews:

  1. that was awful | rating 1;
  2. that was great | rating 5.

Two dicts are created: {'that': 1, 'was': 1, 'awful': 1 } and {'that': 1, 'was': 1, 'great': 1}. And what X and y vectors should look like in that case?

Firstly, your words might be ['that', 'was', 'awful', 'great'].

Then, you might get

X = [[1, 1, 1, 0],
     [1, 1, 0, 1]]
y = [1, 5]

Upvotes: 2

Related Questions