Same
Same

Reputation: 759

Load Pretrained glove vectors in python

I have downloaded pretrained glove vector file from the internet. It is a .txt file. I am unable to load and access it. It is easy to load and access a word vector binary file using gensim but I don't know how to do it when it is a text file format.

Upvotes: 50

Views: 90659

Answers (14)

Jaydeep Pawar
Jaydeep Pawar

Reputation: 1

def create_embedding_matrix(word_to_index):
# word_to_index is dictionary containing "word:token" pairs
nb_words = len(word_to_index)+1

embeddings_index = {}
with open('C:/Users/jayde/Desktop/IISc/DLNLP/Assignment1/glove.840B.300d/glove.840B.300d.txt', encoding="utf-8", errors='ignore') as f:
    for line in f:
        values = line.split()
        word = ''.join(values[:-300])
        coefs = np.asarray(values[-300:], dtype='float32')
        embeddings_index[word] = coefs

embedding_matrix = np.zeros((nb_words, 300))

for word, i in word_to_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

return embedding_matrix

emb_matrix = create_embedding_matrix(vocab_to_int)

Upvotes: 0

Thiago Rainmaker
Thiago Rainmaker

Reputation: 111

a tool with an easy implementation of GloVe is zeulgma

https://pypi.org/project/zeugma/

from zeugma.embeddings import EmbeddingTransformer
glove = EmbeddingTransformer('glove')

the implementation is really very easy

Upvotes: 0

jroz
jroz

Reputation: 123

Some of the other approaches here required more storage space (e.g. to split files) or were quite slow to run on my personal laptop. I tried shelf db but it seemed to blow up in storage size. Here's an "in-place" approach with one-time file-read time cost and very low additional storage cost. We treat the original text file as a database and just store the position location for each of the words. This works really well when you're, e.g., investigating properties of word vectors.

# First create a map from words to position in the file
def get_db_mapping(fname):
    char_ct = 0    # cumulative position in file
    pos_map = dict()

    with open(fname + ".txt", 'r', encoding='utf-8') as f:
        for line in tqdm(f):
            new_len = len(line)     # len of line

            # get the word
            splitlines = line.split()
            word = splitlines[0].strip()

            # store and increment counter
            pos_map[word] = char_ct
            char_ct += new_len

    # write dict
    with open(fname + '.db', 'wb') as handle:
        pickle.dump(pos_map, handle)


class Embedding:
"""Small wrapper so that we can use [] notation to fetch word vectors.
It would be better to just have the file pointer and the pos_map as part
of this class, but that's not how I wrote it initially."""
    def __init__(self, emb_fn):
        self.emb_fn = emb_fn

    def __getitem__(self, item):
        return self.emb_fn(item)


def load_db_mapping(fname, cache_size=1000) -> Embedding:
    """Creates a function closure that wraps access to the db mapping
    and the text file that functions as db. Returns them as an
    Embedding object"""
    # get the two state objects: mapping and file pointer
    with open(fname + '.db', 'rb') as handle:
        pos_map = pickle.load(handle)
    f = open(fname + ".txt", 'r', encoding='utf-8')

    @lru_cache(maxsize=cache_size)
    def get_vector(word: str):
        pos = pos_map[word]
        f.seek(pos, 0)

        # special logic needed because of small count errors
        fail_ct = 0
        read_word = ""
        while fail_ct < 5 and read_word != word:
            fail_ct += 1
            l = f.readline()
            try:
                splitlines = l.split()
                read_word = splitlines[0].strip()
            except:
                continue
        if read_word != word:
            raise ValueError('word not found')

        # actually return
        return np.array([float(val) for val in splitlines[1:]])

    return Embedding(get_vector)

# to run
k_glove_vector_name = 'glove.42B.300d'   # omit .txt
get_db_mapping(k_glove_vector_name)      # run only once; creates .db
word_embedding = load_db_mapping(k_glove_vector_name)
word_embedding['hello']

Upvotes: 0

Strayhorn
Strayhorn

Reputation: 729

Each corpus need to start with a line containing the vocab size and the vector size in that order.

Open the .txt file of the glove model and enter the dimension of the vector at the first line by pressing Enter first:

Example, for glove.6B.50d.txt, just add 400000 50 in the first line.

Then use gensim to transform that raw .txt vector file to gensim vector format:

import gensim

word_vectors = gensim.models.KeyedVectors.load_word2vec_format('path/glove.6B.50d.txt', binary=False)
word_vectors.save('path/glove_gensim.txt')

Upvotes: 0

Karishma Malkan
Karishma Malkan

Reputation: 2109

glove model files are in a word - vector format. You can open the textfile to verify this. Here is a small snippet of code you can use to load a pretrained glove file:

import numpy as np

def load_glove_model(File):
    print("Loading Glove Model")
    glove_model = {}
    with open(File,'r') as f:
        for line in f:
            split_line = line.split()
            word = split_line[0]
            embedding = np.array(split_line[1:], dtype=np.float64)
            glove_model[word] = embedding
    print(f"{len(glove_model)} words loaded!")
    return glove_model

You can then access the word vectors by simply using the gloveModel variable.

print(gloveModel['hello'])

Upvotes: 101

Rudra Desai
Rudra Desai

Reputation: 1

This code takes some time to store glove embeddings on shelf, but loading it is quite faster as compared to other approaches.

import os
import numpy as np
from contextlib import closing
import shelve

def store_glove_to_shelf(glove_file_path,shelf):
    print('Loading Glove')
    with open(os.path.join(glove_file_path)) as f:
        for line in f:
            values = line.split()
            word = values[0]
            vec = np.asarray(values[1:], dtype='float32')
            shelf[word] = vec

shelf_file_name = "glove_embeddings"
glove_file_path = "glove/glove.840B.300d.txt"

# Storing glove embeddings to shelf for faster load
with closing(shelve.open(shelf_file_name + '.shelf', 'c')) as shelf:
    store_glove_to_shelf(glove_file_path,shelf)
    print("Stored glove embeddings from {} to {}".format(glove_file_path,shelf_file_name+'.shelf'))

# To reuse the glove embeddings stored in shelf
with closing(shelve.open(shelf_file_name + '.shelf')) as embeddings_index:
    # USE embeddings_index here , which is a dictionary
    print("Loaded glove embeddings from {}".format(shelf_file_name+'.shelf'))
    print("Found glove embeddings with {} words".format(len(embeddings_index)))

Upvotes: 0

Ursin Brunner
Ursin Brunner

Reputation: 2440

Loading word embedding from a text file (in my case the glove.42B.300d embeddings) takes a bit long (147.2s on my machine).

What helps is converting the text file first into two new files: a text file that contains the words only (e.g. embeddings.vocab) and a binary file which contains the embedding vectors as numpy-structure (e.g. embeddings.npy).

Once converted, it takes me only 4.96s to load the same embeddings into the memory. This approach ends a up with exactly the same dictionary as if you load it from the text file. It is as efficient in access time and does not require any additional frameworks, but a lot faster in loading time.

With this code you convert your embedding text file to the two new files:

def convert_to_binary(embedding_path):
    f = codecs.open(embedding_path + ".txt", 'r', encoding='utf-8')
    wv = []

    with codecs.open(embedding_path + ".vocab", "w", encoding='utf-8') as vocab_write:
        count = 0
        for line in f:
            splitlines = line.split()
            vocab_write.write(splitlines[0].strip())
            vocab_write.write("\n")
            wv.append([float(val) for val in splitlines[1:]])
        count += 1

    np.save(embedding_path + ".npy", np.array(wv))

And with this method you load it efficiently into your memory:

def load_word_emb_binary(embedding_file_name_w_o_suffix):
    print("Loading binary word embedding from {0}.vocab and {0}.npy".format(embedding_file_name_w_o_suffix))

    with codecs.open(embedding_file_name_w_o_suffix + '.vocab', 'r', 'utf-8') as f_in:
        index2word = [line.strip() for line in f_in]

    wv = np.load(embedding_file_name_w_o_suffix + '.npy')
    word_embedding_map = {}
    for i, w in enumerate(index2word):
        word_embedding_map[w] = wv[i]

    return word_embedding_map

Disclaimer: This code is shamelessly stolen from https://blog.ekbana.com/loading-glove-pre-trained-word-embedding-model-from-text-file-faster-5d3e8f2b8455. But it might help in this thread.

Upvotes: 5

Ankan Datta
Ankan Datta

Reputation: 1

import os
import numpy as np

# store all the pre-trained word vectors
print('Loading word vectors...')
word2vec = {}
with open(os.path.join('glove/glove.6B.%sd.txt' % EMBEDDING_DIM)) as f: #enter the path where you unzipped the glove file
  # is just a space-separated text file in the format:
  # word vec[0] vec[1] vec[2] ...
    for line in f:
        values = line.split()
        word = values[0]
        vec = np.asarray(values[1:], dtype='float32')
        word2vec[word] = vec
print('Found %s word vectors.' % len(word2vec))

Upvotes: 0

alabroski
alabroski

Reputation: 79

Python3 version which also handles bigrams and trigrams:

import numpy as np


def load_glove_model(glove_file):
    print("Loading Glove Model")
    f = open(glove_file, 'r')
    model = {}
    vector_size = 300
    for line in f:
        split_line = line.split()
        word = " ".join(split_line[0:len(split_line) - vector_size])
        embedding = np.array([float(val) for val in split_line[-vector_size:]])
        model[word] = embedding
    print("Done.\n" + str(len(model)) + " words loaded!")
    return model

Upvotes: 3

Indrajith Indraprastham
Indrajith Indraprastham

Reputation: 1348

I found this approach faster.

import pandas as pd

df = pd.read_csv('glove.840B.300d.txt', sep=" ", quoting=3, header=None, index_col=0)
glove = {key: val.values for key, val in df.T.items()}

Save the dictionary:

import pickle
with open('glove.840B.300d.pkl', 'wb') as fp:
    pickle.dump(glove, fp)

Upvotes: 11

Abhai Kollara
Abhai Kollara

Reputation: 675

Here's a one liner if all you want is the embedding matrix

np.loadtxt(path, usecols=range(1, dim+1), comments=None)

where path is path to your downloaded GloVe file and dim is the dimension of the word embedding.

If you want both the words and corresponding vectors you can do

glove = np.loadtxt(path, dtype='str', comments=None)

and seperate the words and vectors as follows

words = glove[:, 0]
vectors = glove[:, 1:].astype('float')

Upvotes: 6

Rania ZYANE
Rania ZYANE

Reputation: 49

EMBEDDING_LIFE = 'path/to/your/glove.txt'

def get_coefs(word,*arr): 
      return word, np.asarray(arr, dtype='float32')

embeddings_index = dict(get_coefs(*o.strip().split()) for o in open(EMBEDDING_FILE))

all_embs = np.stack(embeddings_index.values())
emb_mean,emb_std = all_embs.mean(), all_embs.std()
word_index = tokenizer.word_index
nb_words = min(max_features, len(word_index))

embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))

for word, i in word_index.items():
if i >= max_features: continue
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None: embedding_matrix[i] = embedding_vector

Upvotes: -1

Ben
Ben

Reputation: 667

I suggest using gensim to do everything. You can read the file, and also benefit from having a lot of methods already implemented on this great package.

Suppose you generated GloVe vectors using the C++ program and that your "-save-file" parameter is "vectors". Glove executable will generate you two files, "vectors.bin" and "vectors.txt".

Use glove2word2vec to convert GloVe vectors in text format into the word2vec text format:

from gensim.scripts.glove2word2vec import glove2word2vec
glove2word2vec(glove_input_file="vectors.txt", word2vec_output_file="gensim_glove_vectors.txt")

Finally, read the word2vec txt to a gensim model using KeyedVectors:

from gensim.models.keyedvectors import KeyedVectors
glove_model = KeyedVectors.load_word2vec_format("gensim_glove_vectors.txt", binary=False)

Now you can use gensim word2vec methods (for example, similarity) as you'd like.

Upvotes: 47

Petter
Petter

Reputation: 38674

You can do it much faster with pandas:

import pandas as pd
import csv

words = pd.read_table(glove_data_file, sep=" ", index_col=0, header=None, quoting=csv.QUOTE_NONE)

Then to get the vector for a word:

def vec(w):
  return words.loc[w].as_matrix()

And to find the closest word to a vector:

words_matrix = words.as_matrix()

def find_closest_word(v):
  diff = words_matrix - v
  delta = np.sum(diff * diff, axis=1)
  i = np.argmin(delta)
  return words.iloc[i].name

Upvotes: 55

Related Questions