Nltk Sklearn Unigram + Bigram

I'm building classificator using NLTK and nltk.sklearn wrapper.

classifier = SklearnClassifier(LinearSVC(), int,True)
classifier.train(train_set)

When I was using only unigrams and build featureset for example:

{"Cristiano" : True, "Ronaldo : True}

evertyhing was fine. But when I want to use collocations there is a problem. Featureset looks different:

{ {"Cristiano" : True, "Ronaldo : True, ("Cristiano", "Ronaldo") : True }

Then I receive error:

feature_names.sort()TypeError: unorderable types: tuple() < str()

How to create feature set properly for nltk sklearn wrapper using both unigrams and bigrams ?

Upvotes: 1

Answers (2)

Franck Dernoncourt

Reputation: 83177

You could use CountVectorizer from scikit-learn to generate the ngrams.

Demo:

import sklearn.feature_extraction.text

ngram_size = 1
train_set = ['Cristiano plays football', 'Ronaldo like football too']

vectorizer = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(ngram_size,ngram_size))
vectorizer.fit(train_set) # build ngram dictionary
ngram = vectorizer.transform(train_set) # get ngram
print('ngram: {0}\n'.format(ngram))
print('ngram.shape: {0}'.format(ngram.shape))
print('vectorizer.vocabulary_: {0}'.format(vectorizer.vocabulary_))

outputs:

ngram:   (0, 0) 1
  (0, 1)    1
  (0, 3)    1
  (1, 1)    1
  (1, 2)    1
  (1, 4)    1
  (1, 5)    1

ngram.shape: (2, 6)
vectorizer.vocabulary_: {u'cristiano': 0, u'plays': 3, u'like': 2, 
                         u'ronaldo': 4, u'football': 1, u'too': 5}

Upvotes: 2

Alex B

Reputation: 11

If you want to keep using the NLTK warper, you can simply do the following before training the classifier:

classifier._vectorizer.sort = False

Upvotes: 0

Nltk Sklearn Unigram + Bigram

Answers (2)

Related Questions