Reputation: 1041
I used the MySentences
class for extracting sentences from all files in a directory and use this sentences for train a word2vec model.
My dataset is unlabeled.
class MySentences(object):
def __init__(self, dirname):
self.dirname = dirname
def __iter__(self):
for fname in os.listdir(self.dirname):
for line in open(os.path.join(self.dirname, fname)):
yield line.split()
sentences = MySentences('sentences')
model = gensim.models.Word2Vec(sentences)
Now I want to use that class to make a doc2vec model. I read Doc2Vec reference page. Doc2Vec()
function gets sentences as parameter, but it doesn't accept above sentences variable and return error :
AttributeError: 'list' object has no attribute 'words'
What is the problem? What is the correct type of that parameter?
Update :
I think, unlabeled data is the problem. It seems doc2vec needs labeled data.
Upvotes: 1
Views: 1126
Reputation: 3697
Unlike word2vec, doc2vec needs every train entry to be labeled with an unique id. This is needed because later when it predicts similarities its result will be doc ids (unique ids of the train entries), like words are the predictions for word2vec.
Here is a piece of my code that does the exact thing you want to achieve
class DynamicCorpus(object):
def __iter__(self):
with open(csf_file) as fp:
for line in fp:
splt = line.split(':')
text = splt[2].replace('\n', '')
id = splt[0]
yield TaggedDocument(text.split(), [id])
my csv file has format id:text
later you can just feed the corpus to the model
coprus = DynamicCorpus()
d2v = doc2vec.Doc2Vec(min_count=15,
window=10,
vector_size=300,
workers=15,
alpha=0.025,
min_alpha=0.00025,
dm=1)
d2v.build_vocab(corpus)
for epoch in range(training_iterations):
d2v.train(corpus, total_examples=d2v.corpus_count, epochs=d2v.iter)
d2v.alpha -= 0.0002
d2v.min_alpha = d2v.alpha
Upvotes: 0
Reputation: 1041
There is no reason to use extra classes to solve the problem. In new updates of library, a new function TaggedLineDocument
added to transform sentence to vector.
sentences = TaggedLineDocument(INPUT_FILE)
and then, train the model
model = Doc2Vec(alpha=0.025, min_alpha=0.025)
model.build_vocab(sentences)
for epoch in range(10):
model.train(sentences)
model.alpha -= 0.002
model.min_alpha = model.alpha
print epoch
Upvotes: 2