user8566323
user8566323

Reputation:

How to obtain document vectors in doc2vec in gensim

I know to obtain a document vector for a given tag in doc2vec using print(model.docvecs['recipe__11']).

My document vectors are either recipes (tags start with recipe__), newspapers (tags start with news__) or ingredients (tags start with ingre__)

Now I want to retrieve all the document vectors of recipes. The pattern of my recipe documents is recipe__<some number> (e.g., recipe__23, recipe__34). I am interested in knowing if it possible to obtain multiple document vectors using a pattern (e.g., tags starting with recipe__)

Please help me!

Upvotes: 2

Views: 2330

Answers (1)

gojomo
gojomo

Reputation: 54173

There's no pattern-retrieval, but you can access the list of all known (string) doc-tags in model.docvecs.offset2doctag. You could then loop over that list to find all matches, and retrieve each individually.

Also, all the doc-vectors are in a large array model.docvecs.doctag_syn0 And, if you've used exclusively string doc-tags, then the position of a tag in offset2doctag will be exactly the index of the corresponding vector in doctag_syn0. That would allow you to use numpy 'mask indexing' to grab a subset of vectors as a new array, like:

recipes_mask = [tag.startswith('recipe_') for tag in model.dacvecs.offset2doctag]
recipes_vectors = model.docvecs.doctag_syn0[recipes_mask]

Of course, this array-of-vectors no longer has the recipes in the same positions as the original, so you'd need extra steps to know where (for example) the 'recipe__11' vector is in recipes_vectors.

Upvotes: 6

Related Questions