Reputation: 3054
I'm using CountVectorizer to tokenize text and I want to add my own stop words. Why this doesn't work? The word 'de' shouldn't be in the final print.
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(ngram_range=(1,1),stop_words=frozenset([u'de']))
word_tokenizer = vectorizer.build_tokenizer()
print (word_tokenizer(u'Isto é um teste de qualquer coisa.'))
[u'Isto', u'um', u'teste', u'de', u'qualquer', u'coisa']
Upvotes: 3
Views: 4722
Reputation: 2399
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(ngram_range=(1,1),stop_words=frozenset([u'de']))
word_tokenizer = vectorizer.build_tokenizer()
In [7]: vectorizer.vocabulary_
Out[7]: {u'coisa': 0, u'isto': 1, u'qualquer': 2, u'teste': 3, u'um': 4}
you can see that u'de'
is not in the computed vocabulary...
The method build_tokenizer
just tokenized your string, removing the stopwords
is supposed to be done afterwards
from source code of the CountVectorizer
:
def build_tokenizer(self):
"""Return a function that splits a string into a sequence of tokens"""
if self.tokenizer is not None:
return self.tokenizer
token_pattern = re.compile(self.token_pattern)
return lambda doc: token_pattern.findall(doc)
A solution to your problem can be :
vectorizer = CountVectorizer(ngram_range=(1,1),stop_words=frozenset([u'de']))
sentence = [u'Isto é um teste de qualquer coisa.']
tokenized = vectorizer.fit_transform(sentence)
result = vectorizer.inverse_transform(tokenized)
In [12]: result
Out[12]:
[array([u'isto', u'um', u'teste', u'qualquer', u'coisa'],
dtype='<U8')]
Upvotes: 2