Reputation: 171
I use gensim to build dictionary from a collection of documents. Each document is a list of tokens. this my code
def constructModel(self, docTokens):
""" Given document tokens, constructs the tf-idf and similarity models"""
#construct dictionary for the BOW (vector-space) model : Dictionary = a mapping between words and their integer ids = collection of (word_index,word_string) pairs
#print "dictionary"
self.dictionary = corpora.Dictionary(docTokens)
# prune dictionary: remove words that appear too infrequently or too frequently
print "dictionary size before filter_extremes:",self.dictionary#len(self.dictionary.values())
#self.dictionary.filter_extremes(no_below=1, no_above=0.9, keep_n=100000)
#self.dictionary.compactify()
print "dictionary size after filter_extremes:",self.dictionary
#construct the corpus bow vectors; bow vector = collection of (word_id,word_frequency) pairs
corpus_bow = [self.dictionary.doc2bow(doc) for doc in docTokens]
#construct the tf-idf model
self.model = models.TfidfModel(corpus_bow,normalize=True)
corpus_tfidf = self.model[corpus_bow] # first transform each raw bow vector in the corpus to the tfidf model's vector space
self.similarityModel = similarities.MatrixSimilarity(corpus_tfidf) # construct the term-document index
my question is how to add a new doc (tokens) to this dictionary and update it. I searched in gensim documents but I didn't find a solution
Upvotes: 4
Views: 6010
Reputation: 4179
You can just use keyedvectors from gensim.models.keyedvectors
. They are very easy to use.
from gensim.models.keyedvectors import WordEmbeddingsKeyedVectors
w2v = WordEmbeddingsKeyedVectors(50) # 50 = vec length
w2v.add(new_words, their_new_vecs)
AND if you already have built a model using gensim.models.Word2Vec
you can just do this. suppose I want to add the token <UKN>
with a random vector.
model.wv["<UNK>"] = np.random.rand(100) # 100 is the vectors length
The complete example would be like this:
import numpy as np
import gensim.downloader as api
from gensim.models import Word2Vec
dataset = api.load("text8") # load dataset as iterable
model = Word2Vec(dataset)
model.wv["<UNK>"] = np.random.rand(100)
Upvotes: 0
Reputation: 7656
You can use the add_documents
method:
from gensim import corpora
text = [["aaa", "aaa"]]
dictionary = corpora.Dictionary(text)
dictionary.add_documents([['bbb','bbb']])
print(dictionary)
After run the code above, you will get this:
Dictionary(2 unique tokens: ['aaa', 'bbb'])
Read the document for more details.
Upvotes: 3
Reputation: 724
There is documentation for how to do this on the gensim webpage here
The way to do it is create another dictionary with the new documents and then merge them.
from gensim import corpora
dict1 = corpora.Dictionary(firstDocs)
dict2 = corpora.Dictionary(moreDocs)
dict1.merge_with(dict2)
According to the docs, this will map "same tokens to the same ids and new tokens to new ids".
Upvotes: 7