Euijun Jeong
Euijun Jeong

Reputation: 23

Python34 word2vec.Word2Vec OverFlowError

I'm studying word2vec, but when I use word2vec to train text data, occur OverFlowError with Numpy.

the message is,

model.vocab[w].sample_int > model.random.randint(2**32)]
Warning (from warnings module):
  File "C:\Python34\lib\site-packages\gensim\models\word2vec.py", line 636
    warnings.warn("C extension not loaded for Word2Vec, training will be slow. "
UserWarning: C extension not loaded for Word2Vec, training will be slow. Install a C compiler and reinstall gensim for fast training.
Exception in thread Thread-1:
Traceback (most recent call last):
  File "C:\Python34\lib\threading.py", line 920, in _bootstrap_inner
    self.run()
  File "C:\Python34\lib\threading.py", line 868, in run
    self._target(*self._args, **self._kwargs)
  File "C:\Python34\lib\site-packages\gensim\models\word2vec.py", line 675, in worker_loop
    if not worker_one_job(job, init):
  File "C:\Python34\lib\site-packages\gensim\models\word2vec.py", line 666, in worker_one_job
    job_words = self._do_train_job(items, alpha, inits)
  File "C:\Python34\lib\site-packages\gensim\models\word2vec.py", line 623, in _do_train_job
    tally += train_sentence_sg(self, sentence, alpha, work)
  File "C:\Python34\lib\site-packages\gensim\models\word2vec.py", line 112, in train_sentence_sg
    word_vocabs = [model.vocab[w] for w in sentence if w in model.vocab and
  File "C:\Python34\lib\site-packages\gensim\models\word2vec.py", line 113, in <listcomp>
    model.vocab[w].sample_int > model.random.randint(2**32)]
  File "mtrand.pyx", line 935, in mtrand.RandomState.randint (numpy\random\mtrand\mtrand.c:9520)
OverflowError: Python int too large to convert to C long

Can you tell me the cases?

My machine is x64 and OS is windows 7, but python34 is 32bit. numpy and scipy are also 32bit.

Upvotes: 2

Views: 1537

Answers (1)

user5153326
user5153326

Reputation:

I get this as well. It looks like gensim has a potential workaround in the dev branch.

https://github.com/piskvorky/gensim/commit/726102df66000f2afcea82d95634b055e6521dc8

This doesn't solve the core issue of navigating between different hardware and install int sizes, but I think it should alleviate issues with this particular line.

The necessary change involves switching out

model.vocab[w].sample_int > model.random.randint(2**32)

for

model.vocab[w].sample_int > model.random.rand() * 2**32

This avoids the 64 bit / 32 bit int issue created in randint.

UPDATE: I manually incorporated that change into my gensim install and it prevents the error.

Upvotes: 1

Related Questions