Reputation: 423
Are there any latest pre-trained multilingual word embeddings (multiple languages are jointly mapped to a same vector space)?
I have looked at the following but they don't fit my needs:
Here is the problem I'm trying to solve:
I have a list of company names, which can be in any language (mainly English), and I have a list of keywords in English to measure how close a given company name is with regards to the keywords. Now I have a simple keyword matching solution, but I want to improve it using pretrained embeddings. As you can see in the following examples, there are several challenges:
Examples of company names: "cheapfootball ltd.", "wholesalefootball ltd.", "footballer ltd.", "soccershop ltd."
Examples of keywords: "football"
Upvotes: 6
Views: 7842
Reputation: 1241
Check if this would do:
If you're okay with whole word embeddings:
(Both of these are somewhat old, but putting it here in-case it helps someone)
If you're okay with contextual embeddings:
You can even try using the (sentence-piece tokenized) non-contextual input word embeddings instead of the output contextual embeddings, of the multilingual transformer implementations like XLM-R or mBERT. (Not sure how it will perform)
Upvotes: 3
Reputation: 94
I think it might be a little misleading to build a model using embedding into this application(learned by experience). Because if there are two companies, football ltd, and soccer ltd, the model might say both are a match, which might not be right. One approach is to remove redundant words, i.e., corporation from the Facebook corporation, ltd from Facebook ltd and try matching.
Another approach is to use deepmatcher, which uses deep learning fuzzy matching based on words context. Link
If the sentence similarity is the primary approach you want to follow STSBenchmark algorithms might be worth exploring :Link
Sent2vec link and InferSent Link uses Fasttext but seems to have good results on STSBenchmark
Upvotes: 0