Reputation: 469
I'd like to create a CountVectorizer in scikit-learn based on a corpus of text and then add more text to the CountVectorizer later (adding to the original dictionary).
If I use transform()
, it does maintain the original vocabulary, but adds no new words. If I use fit_transform()
, it just regenerates the vocabulary from scratch. See below:
In [2]: count_vect = CountVectorizer()
In [3]: count_vect.fit_transform(["This is a test"])
Out[3]:
<1x3 sparse matrix of type '<type 'numpy.int64'>'
with 3 stored elements in Compressed Sparse Row format>
In [4]: count_vect.vocabulary_
Out[4]: {u'is': 0, u'test': 1, u'this': 2}
In [5]: count_vect.transform(["This not is a test"])
Out[5]:
<1x3 sparse matrix of type '<type 'numpy.int64'>'
with 3 stored elements in Compressed Sparse Row format>
In [6]: count_vect.vocabulary_
Out[6]: {u'is': 0, u'test': 1, u'this': 2}
In [7]: count_vect.fit_transform(["This not is a test"])
Out[7]:
<1x4 sparse matrix of type '<type 'numpy.int64'>'
with 4 stored elements in Compressed Sparse Row format>
In [8]: count_vect.vocabulary_
Out[8]: {u'is': 0, u'not': 1, u'test': 2, u'this': 3}
I'd like the equivalent of an update()
function. I'd like it to work something like this:
In [2]: count_vect = CountVectorizer()
In [3]: count_vect.fit_transform(["This is a test"])
Out[3]:
<1x3 sparse matrix of type '<type 'numpy.int64'>'
with 3 stored elements in Compressed Sparse Row format>
In [4]: count_vect.vocabulary_
Out[4]: {u'is': 0, u'test': 1, u'this': 2}
In [5]: count_vect.update(["This not is a test"])
Out[5]:
<1x3 sparse matrix of type '<type 'numpy.int64'>'
with 4 stored elements in Compressed Sparse Row format>
In [6]: count_vect.vocabulary_
Out[6]: {u'is': 0, u'not': 1, u'test': 2, u'this': 3}
Is there a way to do this?
Upvotes: 3
Views: 2452
Reputation: 5355
The algorithms implemented in scikit-learn
are designed to be fit on all the data at once, which is necessary for most ML algorithms (though interesting not the application that you describe), so there is no update
functionality.
There is a way to get to what you want by thinking of it slightly differently though, see the following code
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
count_vect = CountVectorizer()
count_vect.fit_transform(["This is a test"])
print count_vect.vocabulary_
count_vect.fit_transform(["This is a test", "This is not a test"])
print count_vect.vocabulary_
Which outputs
{u'this': 2, u'test': 1, u'is': 0}
{u'this': 3, u'test': 2, u'is': 0, u'not': 1}
Upvotes: 5