tumbleweed
tumbleweed

Reputation: 4660

How to use labels to classify text with scikit-learn?

I have a NLP task (text clasification). I extracted some bigrams like this:

training_data = [[('this', 'is'), ('is', 'a'), ('a', 'text')],
        [('and', 'one'), ('one', 'more')]]

Then i could use some vectorizer like this:

from sklearn.feature_extraction import FeatureHasher

fh = FeatureHasher(input_type='string')

X = fh.transform(((' '.join(x) for x in sample)
                  for sample in training_data))
print X.toarray()

[[ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]]

This is how svm algorithm can be used to classify:

from sklearn import svm
s = svm.SVC()
lables = [HAM, SPAM]    
s.fit(training_data, labels)

How can i use labels in the above brigam (i.e. training_data) in order to classify?, for example:

data = [[('this', 'is'), ('is', 'a'), ('a', 'text'), 'SPAM'], 
[('and', 'one'), ('one', 'more'), 'HAM']]

Upvotes: 2

Views: 3146

Answers (1)

user823743
user823743

Reputation: 2172

In the above code, assuming that we have a feature vector named doc, if you write:

result = s.predict (doc)

result should be either '0' or '1'. So the prediction result is numeric. Thus, its better to assign the labels accordingly. However, if you still want to assign a string label, then you can assume that for example label 'a' is equivalent to '1' and 'b' to '0'. I know that unlike scikit in nltk the labels are string by default, but is there any difference?

Edit 1: I can see from your first edit that you might have a misconception about feature vectors and their labels. First of all, the type of label you assign doesn't affect the outcome, meaning that if you assign a class label as spam and one as non-spam the classifier doesn't automagically detect spams and non-spams; the classification depends on your feature vector and then for comparison sake a class label. So if you say, I would assume that in my code 0 represents a SPAM and 1 represents a HAM and you would label your data accordingly, it works and its enough. Second issue is that I'm not sure if you know how a bigram feature vector should look like because of the way you represented your data by writing the bellow code:

data = [[('this', 'is'), ('is', 'a'), ('a', 'text'), 'SPAM'], 
[('and', 'one'), ('one', 'more'), 'HAM']] 

A bigram feature vector should contain all possible features present in your data set, and then for representing each document you have to assign a 1 to all the feature present in that document and a 0 to the rest. As an example, I will rewrite your above example in the correct form:

Features:   'this is'  'is a'   'a text'  'and one'   'one more'     Label
doc 1:         1         1         1          0           0           SPAM (or as I explained 0)
doc 2:         0         0         0          1           1           HAM (or as I explained 1)

Now, we can write the feature vectors of the above documens in the following form:

data = [([1,1,1,0,0),(0)],[(0,0,0,1,1),(1)]]

Notice that the label for the first document is 0 (or SPAM) and for the second document its 1 (or HAM). I tried to make a very clear example. When using scikit you might prefer using numpy arrays instead of list. But my example is clear. Reading this question here about bigrams along with my answer might help you. Let me know if you had further questions, but try to think about the above example.

Edit 2: Just in case you are wondering how to write the labels in the variable labels in your code: for each document (which is converted to a feature vector representation) you have to have a corresponding label. In your code array X contains the feature vectors, so in labels you have to have labels with the same position in array as X corresponding to each feature vector. Thus, assuming that you have 100 documents (50 SPAM or 0 and 50 HAM or 1) your labels should look like this:

labels = [0,0,0,0,0,0,0,0,...,1,1,1,1,1,1,1,...]

but this depends on how you have ordered your data. Some classifiers would take the labels like above and some would take 0s and 1s interlayed, such as:

labels = [0,1,0,1,0,1, ...] 

In svm.SVC() you can use the later, however, make sure that your feature vectors are also interlayed and correspond to the right labels.

Upvotes: 3

Related Questions