
Reputation: 4640

How to vectorize bigrams with the hashing-trick in scikit-learn?

I have some bigrams, lets say: [('word','word'),('word','word'),...,('word','word')]. How can i use scikit's HashingVectorizer to create a feature vector that subsequently will be presented to some classification algorithm like e.g. SVC or Naive Bayes or any type of classification algorithm?

Upvotes: 4

Views: 4864

Answers (2)

Fred Foo
Fred Foo

Reputation: 363627

Since you've already extracted the bigrams yourself, you can vectorize using a FeatureHasher. The main thing you need to do is squash the bigrams to strings. E.g.,

>>> data = [[('this', 'is'), ('is', 'a'), ('a', 'text')],
...         [('and', 'one'), ('one', 'more')]]
>>> from sklearn.feature_extraction import FeatureHasher
>>> fh = FeatureHasher(input_type='string')
>>> X = fh.transform(((' '.join(x) for x in sample) for sample in data))
>>> X
<2x1048576 sparse matrix of type '<type 'numpy.float64'>'
    with 5 stored elements in Compressed Sparse Row format>

Upvotes: 3


Reputation: 122112

Firstly, you MUST understand what the different vectorizers are doing. Most vectorizers are based on the bag-of-word approaches where documents are tokens are mapped onto a matrix.

From sklearn documentation, CountVectorizer and HashVectorizer:

Convert a collection of text documents to a matrix of token counts

For instance, these sentences

The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced no evidence that any irregularities took place .

The jury further said in term-end presentments that the City Executive Committee , which had over-all charge of the election , `` deserves the praise and thanks of the City of Atlanta '' for the manner in which the election was conducted .

with this rough vectorizer:

from collections import Counter
from itertools import chain
from string import punctuation

from nltk.corpus import brown, stopwords

# Let's say the training/testing data is a list of words and POS
sentences = brown.sents()[:2]

# Extract the content words as features, i.e. columns.
vocabulary = list(chain(*sentences))
stops = stopwords.words('english') + list(punctuation)
vocab_nostop = [i.lower() for i in vocabulary if i not in stops]

# Create a matrix from the sentences
matrix = [Counter([w for w in words if w in vocab_nostop]) for words in sentences]

print matrix

would become:

[Counter({u"''": 1, u'``': 1, u'said': 1, u'took': 1, u'primary': 1, u'evidence': 1, u'produced': 1, u'investigation': 1, u'place': 1, u'election': 1, u'irregularities': 1, u'recent': 1}), Counter({u'the': 6, u'election': 2, u'presentments': 1, u'``': 1, u'said': 1, u'jury': 1, u'conducted': 1, u"''": 1, u'deserves': 1, u'charge': 1, u'over-all': 1, u'praise': 1, u'manner': 1, u'term-end': 1, u'thanks': 1})]

So this might be rather inefficient considering very large dataset, so the sklearn devs built more efficient code. One of the most important feature of sklearn is that you don't even need to load the dataset into memory before vectorizing it.

Since it's unclear what is your task, i think you're sort of looking for a general use. Let's say you're using it for language ID.

Let's say that your input file for the training data in train.txt:

Pošto je EULEX obećao da će obaviti istragu o prošlosedmičnom izbijanju nasilja na sjeveru Kosova, taj incident predstavlja još jedan ispit kapaciteta misije da doprinese jačanju vladavine prava.
De todas as provações que teve de suplantar ao longo da vida, qual foi a mais difícil? O início. Qualquer começo apresenta dificuldades que parecem intransponíveis. Mas tive sempre a minha mãe do meu lado. Foi ela quem me ajudou a encontrar forças para enfrentar as situações mais decepcionantes, negativas, as que me punham mesmo furiosa.
Al parecer, Andrea Guasch pone que una relación a distancia es muy difícil de llevar como excusa. Algo con lo que, por lo visto, Alex Lequio no está nada de acuerdo. ¿O es que más bien ya ha conseguido la fama que andaba buscando?
Vo väčšine golfových rezortov ide o veľký komplex niekoľkých ihrísk blízko pri sebe spojených s hotelmi a ďalšími možnosťami trávenia voľného času – nie vždy sú manželky či deti nadšenými golfistami, a tak potrebujú iný druh vyžitia. Zaujímavé kombinácie ponúkajú aj rakúske, švajčiarske či talianske Alpy, kde sa dá v zime lyžovať a v lete hrať golf pod vysokými alpskými končiarmi.

And your corresponding labels are Bosnian, Portuguese, Spanish and Slovak, i.e.


Here's one way to use the CountVectorizer and the naive bayes classifier. The following example is from https://github.com/alvations/bayesline of the DSL shared task.

Let's start from the vectorizer. Firstly, the vectorizer takes the input file and then converts the training set into a vectorized matrix and initializes the vectorizer (i.e. features):

import codecs

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

trainfile = 'train.txt'
testfile = 'test.txt'

# Vectorizing data.
train = []
word_vectorizer = CountVectorizer(analyzer='word')
trainset = word_vectorizer.fit_transform(codecs.open(trainfile,'r','utf8'))
tags = ['bs','pt','es','sr']
print word_vectorizer.get_feature_names()


[u'acuerdo', u'aj', u'ajudou', u'al', u'alex', u'algo', u'alpsk\xfdmi', u'alpy', u'andaba', u'andrea', u'ao', u'apresenta', u'as', u'bien', u'bl\xedzko', u'buscando', u'come\xe7o', u'como', u'con', u'conseguido', u'da', u'de', u'decepcionantes', u'deti', u'dificuldades', u'dif\xedcil', u'distancia', u'do', u'doprinese', u'druh', u'd\xe1', u'ela', u'encontrar', u'enfrentar', u'es', u'est\xe1', u'eulex', u'excusa', u'fama', u'foi', u'for\xe7as', u'furiosa', u'golf', u'golfistami', u'golfov\xfdch', u'guasch', u'ha', u'hotelmi', u'hra\u0165', u'ide', u'ihr\xedsk', u'incident', u'intranspon\xedveis', u'in\xedcio', u'in\xfd', u'ispit', u'istragu', u'izbijanju', u'ja\u010danju', u'je', u'jedan', u'jo\u0161', u'kapaciteta', u'kde', u'kombin\xe1cie', u'komplex', u'kon\u010diarmi', u'kosova', u'la', u'lado', u'lequio', u'lete', u'llevar', u'lo', u'longo', u'ly\u017eova\u0165', u'mais', u'man\u017eelky', u'mas', u'me', u'mesmo', u'meu', u'minha', u'misije', u'mo\u017enos\u0165ami', u'muy', u'm\xe1s', u'm\xe3e', u'na', u'nada', u'nad\u0161en\xfdmi', u'nasilja', u'negativas', u'nie', u'nieko\u013ek\xfdch', u'no', u'obaviti', u'obe\u0107ao', u'para', u'parecem', u'parecer', u'pod', u'pone', u'pon\xfakaj\xfa', u'por', u'potrebuj\xfa', u'po\u0161to', u'prava', u'predstavlja', u'pri', u'prova\xe7\xf5es', u'pro\u0161losedmi\u010dnom', u'punham', u'qual', u'qualquer', u'que', u'quem', u'rak\xfaske', u'relaci\xf3n', u'rezortov', u'sa', u'sebe', u'sempre', u'situa\xe7\xf5es', u'sjeveru', u'spojen\xfdch', u'suplantar', u's\xfa', u'taj', u'tak', u'talianske', u'teve', u'tive', u'todas', u'tr\xe1venia', u'una', u've\u013ek\xfd', u'vida', u'visto', u'vladavine', u'vo', u'vo\u013en\xe9ho', u'vysok\xfdmi', u'vy\u017eitia', u'v\xe4\u010d\u0161ine', u'v\u017edy', u'ya', u'zauj\xedmav\xe9', u'zime', u'\u0107e', u'\u010dasu', u'\u010di', u'\u010fal\u0161\xedmi', u'\u0161vaj\u010diarske']

Let's say your test documents are in test.txt, which labels are Spanish es and Portuguese pt:

Por ello, ha insistido en que Europa tiene que darle un toque de atención porque Portugal esta incumpliendo la directiva del establecimiento del peaje
Estima-se que o mercado homossexual só na Cidade do México movimente cerca de oito mil milhões de dólares, aproximadamente seis mil milhões de euros

Now, you can label the test documents with the trained classifier as such:

import codecs, re, time
from itertools import chain

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

trainfile = 'train.txt'
testfile = 'test.txt'

# Vectorizing data.
train = []
word_vectorizer = CountVectorizer(analyzer='word')
trainset = word_vectorizer.fit_transform(codecs.open(trainfile,'r','utf8'))
tags = ['bs','pt','es','sr']

# Training NB
mnb = MultinomialNB()
mnb.fit(trainset, tags)

# Tagging the documents
testset = word_vectorizer.transform(codecs.open(testfile,'r','utf8'))
results = mnb.predict(testset)

print results


['es' 'pt']

For more information of text classification, possibly you might find this NLTK related question/answer useful, see nltk NaiveBayesClassifier training for sentiment analysis

To use the HashingVectorizer, you need to note that it produces vector values that are negative and MultinomialNaiveBayes classifier don't do negative values, so you would have to use another classifier, as such:

import codecs, re, time
from itertools import chain

from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import Perceptron

trainfile = 'train.txt'
testfile = 'test.txt'

# Vectorizing data.
train = []
word_vectorizer = HashingVectorizer(analyzer='word')
trainset = word_vectorizer.fit_transform(codecs.open(trainfile,'r','utf8'))
tags = ['bs','pt','es','sr']

# Training Perceptron
pct = Perceptron(n_iter=100)
pct.fit(trainset, tags)

# Tagging the documents
testset = word_vectorizer.transform(codecs.open(testfile,'r','utf8'))
results = pct.predict(testset)

print results


['es' 'es']

But do note that the results of the perceptron is worse in this small example. Different classifier fits different task and different features fit different vectors, also different classifiers accepts different vectors.

There is no perfect model, just better or worse

Upvotes: 6

Related Questions