Reputation: 615
I want to load a previously trained word2vec model into gensim. The trouble is the file format. It is not a .bin file format but a .tar file. It is the model / file deu-ch_web-public_2019_1M.tar.gz from the University of Leipzig. The model is also listed on HuggingFace where different word2vec models for English and German are listed.
First I tried:
from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format('deu-ch_web-public_2019_1M.tar.gz')
--> ValueError: invalid literal for int() with base 10: 'deu-ch_web-public_2019_1M
Then I unzipped the file with 7-Zip and tried the following:
from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format('deu-ch_web-public_2019_1M.tar')
--> ValueError: invalid literal for int() with base 10: 'deu-ch_web-public_2019_1M
from gensim.models import word2vec
model = word2vec.Word2Vec.load('deu-ch_web-public_2019_1M.tar')
--> UnpicklingError: could not find MARK
Then I got a bit desperate...
import gensim.downloader
model = gensim.downloader.load('deu-ch_web-public_2019_1M.tar')
--> ValueError: Incorrect model/corpus name
Googling around I found useful information how to load a .bin model with gensim ( see here and here ). Following this thread it seems tricky to load a .tar file with gensim. Especially if one has not one .txt file but five .txt files as in this case. I found one answer how to read a .tar file but with tensorflow. Since I am not familiar with tensorflow, I prefer to use gensim. Any thoughts how to solve the issue is appreciated.
Upvotes: 0
Views: 206
Reputation: 615
Following gojomo's comment I decided to use DevMount's and Spacy's pre-trained German models to find synonyms. First the models need to be downloaded and unzipped. See posts above. DevMount's model can be downloaded here. Spacy has three pre-trained German models. They can be downloaded here. In the spirit of sharing, below is the code I implemented.
DevMount's Model
import gensim
from gensim.models import KeyedVectors
# Load pre-trained Word2Vec model.
model = KeyedVectors.load_word2vec_format('german.model', binary = True)
#get most similar word for Vertrauen
model.most_similar(model['Vertrauen'])
#get most similar word for Vertrauen but delineate it from word Mutter
model.most_similar(model['Vertrauen']-model['Mutter'])
#get embedding coordinates for Vertrauen
model['Vertrauen']
Spacy
import numpy as np
import spacy
model = spacy.load('de_core_news_md')
keyWord = 'Vertrauen'
ms = model.vocab.vectors.most_similar(
np.asarray([model.vocab.vectors[model.vocab.strings[keyWord]]]), n=10)
words = [model.vocab.strings[w] for w in ms[0][0]]
distances = ms[2]
print(words)
--> The models de_core_news_sm
and de_core_news_lg
threw unpickling and/or key errors in my case. So decided to use de_core_news_md
.
Also discovered LeoLM from LAION. It is a German Foundation Language Model built on Llama-2. So it is an open-source large language model (LLM). The model can be downloaded but has an online interface on HuggingFace. As an LLM it can, amongst other things, be queried for synonyms.
Upvotes: 1
Reputation: 54173
A .tar
file is a bundle of one or more directories and files – see https://en.wikipedia.org/wiki/Tar_(computing) – and thus not the sort of single-model file that you should expect Gensim to open directly.
Rather, similar to as with a .zip
file, you'd use some purpose-specific software to extract any content inside the .tar
into individual files – then point Gensim at those, individually, if they're formats Gensim understands.
A typical command-line operation to extract the individual file(s) from a .tar.gz
file (which is both tarred & gzipped) would be:
tar -xvzf deu-ch_web-public_2019_1M.tar.gz
That tells the command to ex
tract with v
erbose reporting while also un-gz
ipping the f
ile deu-ch_web-public_2019_1M.tar.gz
. Then you'll have one or more new local files, which are the actual (not-packaged-up) files of interest.
In some graphical UI file-explorers, like the MacOS 'Finder', simply double-clicking to perform the default 'open' action on deu-ch_web-public_2019_1M.tar.gz
will perform this expansion (no tar
command-line needed).
But note: the University of Liepzig page you've linked describes these files as 'corpora' (training texts), not trained sets of word-vectors or word2vec models.
And I looked at the "2019 - switzerland - public web file" you're referring-to, and inside is a directory (folder) deu-ch_web-public_2019_1M
, with 7 .txt
files inside of various formats, and 1 .sql
file. But none of those are any sort of trained word-vectors - just text & text-statistics.
You could use those to train a model yourself. The deu-ch_web-public_2019_1M-sentences.txt
is closest to what you need, as 1 million plain-text sentences.
But it's still not yet in a form fully ready for word2vec training. Each line has a redundant line-number at the front, and the text hasn't yet been tokenized into word-tokens (which would potentially remove punctuation, or sometimes keep punctuation as distinct tokens). And, as a mere 15 million words total, it's still fairly small as a corpus for creating a powerful word2vec model.
Upvotes: 0
Reputation: 5015
There are some ways to load a gensim
model.
First, extract the content of the compressed .tar
file.
Then, if the content is a .txt
file:
from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format('/home/user/file.txt')
If the file is a .model
file:
import gensim
from gensim.models import Word2Vec
content = "/home/user/file.word2vec.model"
model = gensim.models.Word2Vec.load(content)
Upvotes: 0