Doc2Vec input format

Question

running gensim Doc2Vec over ubuntu

Doc2Vec rejects my input with the error

AttributeError: 'list' object has no attribute 'words'

    import gensim from gensim.models  
    import doc2vec as dtv
    from nltk.corpus import brown
    documents = brown.tagged_sents()
    d2vmodel = > dtv.Doc2Vec(documents, size=100, window=1, min_count=1, workers=1)

I have tried already from this SO question and many variations with the same result

documents = [brown.tagged_sents()} adding a hash function

If corpus is a .txt file I can utilize

    documents=TaggedLineDocument(documents)

but that is often not possible

gojomo · Accepted Answer

Gensim's Doc2Vec requires each document to be in the form of an object with a words property that is a list of string tokens, and a tags property that is a list of tags. These tags are usually strings, but expert users with large datasets can save a little memory by using plain-ints, starting from 0, instead.

A class TaggedDocument is included that is of the right 'shape', and used in most of the Gensim documentation/tutorial examples – but given Python's 'duck typing', any object with words and tags properties will do.

But a plain list won't.

And if I understand correctly, brown.tagged_sents() will return lists of (word, part-of-speech-tag) tuples, which isn't even the kind of list-of-word-tokens that would work as a words, and doesn't supply any of the full-document tags that are what Doc2Vec needs as keys to the doc-vectors that get trained.

Separately: it is unlikely you'd want to use min_count=1. Discarding very-low-frequency words usually makes retained Word2Vec/Doc2Vec vectors better.

Doc2Vec input format

Answers (1)

Related Questions