Reputation: 1560
I'm building classificator using NLTK and nltk.sklearn wrapper.
classifier = SklearnClassifier(LinearSVC(), int,True)
classifier.train(train_set)
When I was using only unigrams and build featureset for example:
{"Cristiano" : True, "Ronaldo : True}
evertyhing was fine. But when I want to use collocations there is a problem. Featureset looks different:
{ {"Cristiano" : True, "Ronaldo : True, ("Cristiano", "Ronaldo") : True }
Then I receive error:
feature_names.sort()TypeError: unorderable types: tuple() < str()
How to create feature set properly for nltk sklearn wrapper using both unigrams and bigrams ?
Upvotes: 1
Views: 3889
Reputation: 83177
You could use CountVectorizer from scikit-learn to generate the ngrams.
Demo:
import sklearn.feature_extraction.text
ngram_size = 1
train_set = ['Cristiano plays football', 'Ronaldo like football too']
vectorizer = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(ngram_size,ngram_size))
vectorizer.fit(train_set) # build ngram dictionary
ngram = vectorizer.transform(train_set) # get ngram
print('ngram: {0}\n'.format(ngram))
print('ngram.shape: {0}'.format(ngram.shape))
print('vectorizer.vocabulary_: {0}'.format(vectorizer.vocabulary_))
outputs:
ngram: (0, 0) 1
(0, 1) 1
(0, 3) 1
(1, 1) 1
(1, 2) 1
(1, 4) 1
(1, 5) 1
ngram.shape: (2, 6)
vectorizer.vocabulary_: {u'cristiano': 0, u'plays': 3, u'like': 2,
u'ronaldo': 4, u'football': 1, u'too': 5}
Upvotes: 2
Reputation: 11
If you want to keep using the NLTK warper, you can simply do the following before training the classifier:
classifier._vectorizer.sort = False
Upvotes: 0