Pedram
Pedram

Reputation: 2611

Gensim Doc2Vec model only generates a limited number of vectors

I am using gensim Doc2Vec model to generate my feature vectors. Here is the code I am using (I have explained what my problem is in the code):

cores = multiprocessing.cpu_count()

# creating a list of tagged documents
training_docs = []

# all_docs: a list of 53 strings which are my documents and are very long (not just a couple of sentences)
for index, doc in enumerate(all_docs):
    # 'doc' is in unicode format and I have already preprocessed it
    training_docs.append(TaggedDocument(doc.split(), str(index+1)))

# at this point, I have 53 strings in my 'training_docs' list 

model = Doc2Vec(training_docs, size=400, window=8, min_count=1, workers=cores)

# now that I print the vectors, I only have 10 vectors while I should have 53 vectors for the 53 documents that I have in my training_docs list.
print(len(model.docvecs))
# output: 10

I am just wondering if I am doing a mistake or if there is any other parameter that I should set?

UPDATE: I was playing with the tags parameter in TaggedDocument, and when I changed it to a mixture of text and numbers like: Doc1, Doc2, ... I see a different number for the count of generated vectors, but still I do not have the same number of feature vectors as expected.

Upvotes: 0

Views: 316

Answers (1)

gojomo
gojomo

Reputation: 54153

Look at the actual tags it has discovered in your corpus:

print(model.docvecs.offset2doctag)

Do you see a pattern?

The tags property of each document should be a list of tags, not a single tag. If you supply a simple string-of-an-integer, it will see it as a list-of-digits, and thus only learn the tags '0', '1', ..., '9'.

You could replace str(index+1) with [str(index+1)] and get the behavior you were expecting.

But, since your document IDs are just ascending integers, you can also just use plain Python ints as your doctags. This will save some memory, buy avoiding the creation of a lookup dict from string-tag to array-slot (int). To do this, replace the str(index+1) with [index]. (This starts the doc-IDs from 0 – which is a teensy bit more Pythonic, and also avoids wasting an unused 0 position in the raw array that holds the trained vectors.)

Upvotes: 1

Related Questions