Are there any latest pre-trained multilingual word embeddings (multiple languages are jointly mapped to a same vector space)? I have looked at the following but they don't fit my needs: FastText / MUSE ( https://fasttext.cc/docs/en/aligned-vectors.html ): this one seems too old, and the word vectors are not using subwords / wordpiece information. LASER ( https://github.com/yannvgn/laserembeddings ): I'm now using this one, it's using subword information (via BPE), however, it's suggested that not to use this for word embedding because it's designed to embed sentences ( https://github.com/facebookresearch/LASER/issues/69 ). BERT multilingual (bert-base-multilingual-uncased in https://huggingface.co/transformers/pretrained_models.html ): it's contextualised embeddings that can be used to embed sentences, and seems not good at embedding words without contexts. Here is the problem I'm trying to solve: I have a list of company names, which can be in any language (mainly English), and I have a list of keywords in English to measure how close a given company name is with regards to the keywords. Now I have a simple keyword matching solution, but I want to improve it using pretrained embeddings. As you can see in the following examples, there are several challenges: keyword and brand name is not separated by space (now I'm using package "wordsegment" to split words into subwords), so embedding with subword info should help a lot keyword list is not extensive and company name could be in different languages (that's why I want to use embedding, because "soccer" is close to "football") Examples of company names: "cheapfootball ltd.", "wholesalefootball ltd.", "footballer ltd.", "soccershop ltd." Examples of keywords: "football"

nlpword-embeddingpre-trained-modelfasttextbert-language-model

Reputation: 423

Latest Pre-trained Multilingual Word Embedding

Are there any latest pre-trained multilingual word embeddings (multiple languages are jointly mapped to a same vector space)?

I have looked at the following but they don't fit my needs:

FastText / MUSE (https://fasttext.cc/docs/en/aligned-vectors.html): this one seems too old, and the word vectors are not using subwords / wordpiece information.
LASER (https://github.com/yannvgn/laserembeddings): I'm now using this one, it's using subword information (via BPE), however, it's suggested that not to use this for word embedding because it's designed to embed sentences (https://github.com/facebookresearch/LASER/issues/69).
BERT multilingual (bert-base-multilingual-uncased in https://huggingface.co/transformers/pretrained_models.html): it's contextualised embeddings that can be used to embed sentences, and seems not good at embedding words without contexts.

Here is the problem I'm trying to solve:

I have a list of company names, which can be in any language (mainly English), and I have a list of keywords in English to measure how close a given company name is with regards to the keywords. Now I have a simple keyword matching solution, but I want to improve it using pretrained embeddings. As you can see in the following examples, there are several challenges:

keyword and brand name is not separated by space (now I'm using package "wordsegment" to split words into subwords), so embedding with subword info should help a lot
keyword list is not extensive and company name could be in different languages (that's why I want to use embedding, because "soccer" is close to "football")

Examples of company names: "cheapfootball ltd.", "wholesalefootball ltd.", "footballer ltd.", "soccershop ltd."

Examples of keywords: "football"

Upvotes: 6

Answers (2)

Gokul NC

Reputation: 1241

Check if this would do:

Multilingual BPE-based embeddings
- Aligned multilingual sub-word vectors

If you're okay with whole word embeddings:
(Both of these are somewhat old, but putting it here in-case it helps someone)

If you're okay with contextual embeddings:

You can even try using the (sentence-piece tokenized) non-contextual input word embeddings instead of the output contextual embeddings, of the multilingual transformer implementations like XLM-R or mBERT. (Not sure how it will perform)

Upvotes: 3

Omar Saleem

Reputation: 94

I think it might be a little misleading to build a model using embedding into this application(learned by experience). Because if there are two companies, football ltd, and soccer ltd, the model might say both are a match, which might not be right. One approach is to remove redundant words, i.e., corporation from the Facebook corporation, ltd from Facebook ltd and try matching.

Another approach is to use deepmatcher, which uses deep learning fuzzy matching based on words context. Link

If the sentence similarity is the primary approach you want to follow STSBenchmark algorithms might be worth exploring :Link

Sent2vec link and InferSent Link uses Fasttext but seems to have good results on STSBenchmark

Upvotes: 0

Latest Pre-trained Multilingual Word Embedding

Answers (2)

Related Questions