Reputation: 918
running gensim Doc2Vec over ubuntu
Doc2Vec rejects my input with the error
AttributeError: 'list' object has no attribute 'words'
import gensim from gensim.models
import doc2vec as dtv
from nltk.corpus import brown
documents = brown.tagged_sents()
d2vmodel = > dtv.Doc2Vec(documents, size=100, window=1, min_count=1, workers=1)
I have tried already from this SO question and many variations with the same result
documents = [brown.tagged_sents()} adding a hash function
If corpus is a .txt file I can utilize
documents=TaggedLineDocument(documents)
but that is often not possible
Upvotes: 0
Views: 792
Reputation: 54173
Gensim's Doc2Vec
requires each document to be in the form of an object with a words
property that is a list of string tokens, and a tags
property that is a list of tags. These tags are usually strings, but expert users with large datasets can save a little memory by using plain-ints, starting from 0, instead.
A class TaggedDocument
is included that is of the right 'shape', and used in most of the Gensim documentation/tutorial examples – but given Python's 'duck typing', any object with words
and tags
properties will do.
But a plain list won't.
And if I understand correctly, brown.tagged_sents()
will return lists of (word, part-of-speech-tag) tuples, which isn't even the kind of list-of-word-tokens that would work as a words
, and doesn't supply any of the full-document tags that are what Doc2Vec
needs as keys to the doc-vectors that get trained.
Separately: it is unlikely you'd want to use min_count=1
. Discarding very-low-frequency words usually makes retained Word2Vec
/Doc2Vec
vectors better.
Upvotes: 1