Green
Green

Reputation: 685

Import GoogleNews-vectors-negative300.bin

I am working on code using the gensim and having a tough time troubleshooting a ValueError within my code. I finally was able to zip GoogleNews-vectors-negative300.bin.gz file so I could implement it in my model. I also tried gzip which the results were unsuccessful. The error in the code occurs in the last line. I would like to know what can be done to fix the error. Is there any workarounds? Finally, is there a website that I could reference?

Thank you respectfully for your assistance!

import gensim
from keras import backend
from keras.layers import Dense, Input, Lambda, LSTM, TimeDistributed
from keras.layers.merge import concatenate
from keras.layers.embeddings import Embedding
from keras.models import Mode

pretrained_embeddings_path = "GoogleNews-vectors-negative300.bin"
word2vec = 
gensim.models.KeyedVectors.load_word2vec_format(pretrained_embeddings_path, 
binary=True)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-3-23bd96c1d6ab> in <module>()
  1 pretrained_embeddings_path = "GoogleNews-vectors-negative300.bin"
----> 2 word2vec = 
gensim.models.KeyedVectors.load_word2vec_format(pretrained_embeddings_path, 
binary=True)

C:\Users\green\Anaconda3\envs\py35\lib\site-
packages\gensim\models\keyedvectors.py in load_word2vec_format(cls, fname, 
fvocab, binary, encoding, unicode_errors, limit, datatype)
244                             word.append(ch)
245                     word = utils.to_unicode(b''.join(word), 
encoding=encoding, errors=unicode_errors)
--> 246                     weights = fromstring(fin.read(binary_len), 
dtype=REAL)
247                     add_word(word, weights)
248             else:

ValueError: string size must be a multiple of element size

Upvotes: 29

Views: 83455

Answers (6)

Maciej Skorski
Maciej Skorski

Reputation: 3354

Also available from figshare:

wget https://figshare.com/ndownloader/files/10798046 -O GoogleNews-vectors-negative300.bin

Upvotes: 2

ohsoifelse
ohsoifelse

Reputation: 683

Edit: The S3 url has stopped working. You can download the data from Kaggle or use this Google Drive link (be careful downloading files from Google Drive).

The below commands no longer work work.

brew install wget

wget -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"

This downloads the GZIP compressed file that you can uncompress using:

gzip -d GoogleNews-vectors-negative300.bin.gz

You can then use the below command to get wordVector.

from gensim import models

w = models.KeyedVectors.load_word2vec_format(
    '../GoogleNews-vectors-negative300.bin', binary=True)

Upvotes: 47

Piotr Rusin
Piotr Rusin

Reputation: 11

You can use this URL that points to Google Drive's download of the bin.gz file: https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?resourcekey=0-wjGZdNAUop6WykTtMip30g

Alternative mirrors (including the S3 mentioned here) seem to be broken.

Upvotes: 1

Anjana
Anjana

Reputation: 71

Here is what worked for me. I loaded a part of the model and not the entire model as it's huge.

!pip install wget

import wget
url = 'https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz'
filename = wget.download(url)

f_in = gzip.open('GoogleNews-vectors-negative300.bin.gz', 'rb')
f_out = open('GoogleNews-vectors-negative300.bin', 'wb')
f_out.writelines(f_in)

import gensim
from gensim.models import Word2Vec, KeyedVectors
from sklearn.decomposition import PCA

model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True, limit=100000)

Upvotes: 4

hansrajswapnil
hansrajswapnil

Reputation: 639

try this -

import gensim.downloader as api

wv = api.load('word2vec-google-news-300')

vec_king = wv['king']

also, visit this link : https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html#sphx-glr-auto-examples-tutorials-run-word2vec-py

Upvotes: 21

user8403237
user8403237

Reputation:

you have to write the complete path.

use this path:

https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz

Upvotes: 17

Related Questions