Reputation: 3744
The file GoogleNews-vectors-negative300.bin
contains 300 million word-vectors. I think (not sure) this file is loaded when the following line is written:
from gensim.models.keyedvectors import KeyedVectors
I want to download the vectors for words that I give externally in a list called words
. This is my code to do this:
import math
import sys
import gensim
import warnings
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')
from gensim.models.keyedvectors import KeyedVectors
words = ['access', 'aeroway', 'airport', 'amenity', 'area', 'atm', 'barrier', 'bay', 'bench', 'boundary', 'bridge', 'building', 'bus', 'cafe', 'car', 'coast', 'continue', 'created', 'defibrillator', 'drinking', 'ele', 'embankment', 'entrance', 'ferry', 'foot', 'fountain', 'fuel', 'gate', 'golf', 'gps', 'grave', 'highway', 'horse', 'hospital', 'house', 'landuse', 'layer', 'leisure', 'man', 'manmade', 'market', 'marketplace', 'maxheight', 'name', 'natural', 'noexit', 'oneway', 'park', 'parking', 'pgs', 'place', 'worship', 'playground', 'police', 'police station', '', 'post', 'post box or mail', 'power', 'powerstation', 'private', 'public', 'railway', 'ref', 'residential', 'restaurant', 'road', 'route', 'school', 'shelter', 'shop', 'source', 'sport', 'toilet', 'toilets', 'tourism', 'unknown', 'vehicle', 'vending', 'vending machine', 'village', 'wall', 'waste', 'water', 'waterway', 'worship'];
model = gensim.models.KeyedVectors.load_word2vec_format(words, binary=True)
M = len(words)
count = 0
for i in range(1,M):
wi = id2word[words[i]]
if wi in word2vec.vocab:
vector[:,count] = model[:,i]
count = count+1
f = open('word_vectors.csv', 'w')
print(vector, file=f)
f.close()
But when I run the code, it just freezes up my system. Is it because it is loading the whole of the binary file before searching for the words in words
? If yes, how do I get around this issue? I think of this as I get the following warning, which is why I use the warning
package to suppress it:
c:\Python35\lib\site-packages\gensim\utils.py:860: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")
And the error it gives is:
Traceback (most recent call last):
File "word2vec.py", line 18, in <module>
model = gensim.models.KeyedVectors.load_word2vec_format(topic, binary=True)
File "c:\Python35\lib\site-packages\gensim\models\keyedvectors.py", line 196, in load_word2vec_format
with utils.smart_open(fname) as fin:
File "c:\Python35\lib\site-packages\smart_open\smart_open_lib.py", line 208, in smart_open
raise TypeError('don\'t know how to handle uri %s' % repr(uri))
TypeError: don't know how to handle uri [['access'], ['aeroway'], ['airport'], ['amenity'], ['area'], ['atm'], ['barrier'], ['bay'], ['bench'], ['boundary'], ['bridge'], ['building'], ['bus'], ['cafe'], ['car'], ['coast'], ['continue'], ['created'], ['defibrillator'], ['drinking'], ['ele'], ['embankment'], ['entrance'], ['ferry'], ['foot'], ['fountain'], ['fuel'], ['gate'], ['golf'], ['gps'], ['grave'], ['highway'], ['horse'], ['hospital'], ['house'], ['landuse'], ['layer'], ['leisure'], ['man'], ['manmade'], ['market'], ['marketplace'], ['maxheight'], ['name'], ['natural'], ['noexit'], ['oneway'], ['park'], ['parking'], ['pgs'], ['place'], ['worship'], ['playground'], ['police'], ['police station'], [''], ['post'], ['post box or mail'], ['power'], ['powerstation'], ['private'], ['public'], ['railway'], ['ref'], ['residential'], ['restaurant'], ['road'], ['route'], ['school'], ['shelter'], ['shop'], ['source'], ['sport'], ['toilet'], ['toilets'], ['tourism'], ['unknown'], ['vehicle'], ['vending'], ['vending machine'], ['village'], ['wall'], ['waste'], ['water'], ['waterway'], ['worship']]
This I guess means that the program is not able to search for the words in the binary file. So, how to solve it?
Upvotes: 2
Views: 3998
Reputation: 1208
Use the following code to extract the word vector from the Google trained model for word2vec:
import math
import sys
import gensim
import warnings
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')
# this line doesn't load the trained model
from gensim.models.keyedvectors import KeyedVectors
words = ['access', 'aeroway', 'airport']
# this is how you load the model
model = KeyedVectors.load_word2vec_format(path_to_model, binary=True)
# to extract word vector
print(model[words[0]]) #access
Result vector:
[ -8.74023438e-02 -1.86523438e-01 .. ]
Your system is freezing because of the large size of model. Try using system with more memory or you can limit the size of model you are loading.
Limit model size while loading
model = KeyedVectors.load_word2vec_format(path_to_model, binary=True, limit=20000)
Upvotes: 7