Simone
Simone

Reputation: 615

Load word2vec model that is in .tar format

I want to load a previously trained word2vec model into gensim. The trouble is the file format. It is not a .bin file format but a .tar file. It is the model / file deu-ch_web-public_2019_1M.tar.gz from the University of Leipzig. The model is also listed on HuggingFace where different word2vec models for English and German are listed.

First I tried:

from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format('deu-ch_web-public_2019_1M.tar.gz')

--> ValueError: invalid literal for int() with base 10: 'deu-ch_web-public_2019_1M

Then I unzipped the file with 7-Zip and tried the following:

from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format('deu-ch_web-public_2019_1M.tar')

--> ValueError: invalid literal for int() with base 10: 'deu-ch_web-public_2019_1M

from gensim.models import word2vec
model = word2vec.Word2Vec.load('deu-ch_web-public_2019_1M.tar')

--> UnpicklingError: could not find MARK

Then I got a bit desperate...

import gensim.downloader
model = gensim.downloader.load('deu-ch_web-public_2019_1M.tar')

--> ValueError: Incorrect model/corpus name

Googling around I found useful information how to load a .bin model with gensim ( see here and here ). Following this thread it seems tricky to load a .tar file with gensim. Especially if one has not one .txt file but five .txt files as in this case. I found one answer how to read a .tar file but with tensorflow. Since I am not familiar with tensorflow, I prefer to use gensim. Any thoughts how to solve the issue is appreciated.

Upvotes: 0

Views: 206

Answers (3)

Simone
Simone

Reputation: 615

Following gojomo's comment I decided to use DevMount's and Spacy's pre-trained German models to find synonyms. First the models need to be downloaded and unzipped. See posts above. DevMount's model can be downloaded here. Spacy has three pre-trained German models. They can be downloaded here. In the spirit of sharing, below is the code I implemented.

DevMount's Model

import gensim
from gensim.models import KeyedVectors

# Load pre-trained Word2Vec model.
model = KeyedVectors.load_word2vec_format('german.model', binary = True)
#get most similar word for Vertrauen
model.most_similar(model['Vertrauen'])
#get most similar word for Vertrauen but delineate it from word Mutter
model.most_similar(model['Vertrauen']-model['Mutter'])
#get embedding coordinates for Vertrauen
model['Vertrauen']

Spacy

import numpy as np
import spacy

model = spacy.load('de_core_news_md')

keyWord = 'Vertrauen'

ms = model.vocab.vectors.most_similar(
    np.asarray([model.vocab.vectors[model.vocab.strings[keyWord]]]), n=10)
words = [model.vocab.strings[w] for w in ms[0][0]]
distances = ms[2]
print(words)

--> The models de_core_news_sm and de_core_news_lg threw unpickling and/or key errors in my case. So decided to use de_core_news_md.

Also discovered LeoLM from LAION. It is a German Foundation Language Model built on Llama-2. So it is an open-source large language model (LLM). The model can be downloaded but has an online interface on HuggingFace. As an LLM it can, amongst other things, be queried for synonyms.

Upvotes: 1

gojomo
gojomo

Reputation: 54173

A .tar file is a bundle of one or more directories and files – see https://en.wikipedia.org/wiki/Tar_(computing) – and thus not the sort of single-model file that you should expect Gensim to open directly.

Rather, similar to as with a .zip file, you'd use some purpose-specific software to extract any content inside the .tar into individual files – then point Gensim at those, individually, if they're formats Gensim understands.

A typical command-line operation to extract the individual file(s) from a .tar.gz file (which is both tarred & gzipped) would be:

tar -xvzf deu-ch_web-public_2019_1M.tar.gz

That tells the command to extract with verbose reporting while also un-gzipping the file deu-ch_web-public_2019_1M.tar.gz. Then you'll have one or more new local files, which are the actual (not-packaged-up) files of interest.

In some graphical UI file-explorers, like the MacOS 'Finder', simply double-clicking to perform the default 'open' action on deu-ch_web-public_2019_1M.tar.gz will perform this expansion (no tar command-line needed).

But note: the University of Liepzig page you've linked describes these files as 'corpora' (training texts), not trained sets of word-vectors or word2vec models.

And I looked at the "2019 - switzerland - public web file" you're referring-to, and inside is a directory (folder) deu-ch_web-public_2019_1M, with 7 .txt files inside of various formats, and 1 .sql file. But none of those are any sort of trained word-vectors - just text & text-statistics.

You could use those to train a model yourself. The deu-ch_web-public_2019_1M-sentences.txt is closest to what you need, as 1 million plain-text sentences.

But it's still not yet in a form fully ready for word2vec training. Each line has a redundant line-number at the front, and the text hasn't yet been tokenized into word-tokens (which would potentially remove punctuation, or sometimes keep punctuation as distinct tokens). And, as a mere 15 million words total, it's still fairly small as a corpus for creating a powerful word2vec model.

Upvotes: 0

razimbres
razimbres

Reputation: 5015

There are some ways to load a gensim model.

First, extract the content of the compressed .tar file.

Then, if the content is a .txt file:

from gensim.models import KeyedVectors

model = KeyedVectors.load_word2vec_format('/home/user/file.txt')

If the file is a .model file:

import gensim
from gensim.models import Word2Vec

content = "/home/user/file.word2vec.model"

model = gensim.models.Word2Vec.load(content)

Upvotes: 0

Related Questions