Why the FastText word embedding could generate the representation of a word from another language?

Question

Recently, I trained a FastText word embedding from sentiment140 to get the representation for English words. However, today just for a trial, I run the FastText module on a couple of Chinese words, for instance:

import gensim.models as gs

path = r'\data\word2vec'

w2v = gs.FastText.load(os.path.join(path, 'fasttext_model'))

w2v.wv['哈哈哈哈']

It outputs:

array([ 0.00303676,  0.02088235, -0.00815559,  0.00484574, -0.03576371,
       -0.02178247, -0.05090654,  0.03063928, -0.05999983,  0.04547168,
       -0.01778449, -0.02716631, -0.03326027, -0.00078981,  0.0168153 ,
        0.00773436,  0.01966593, -0.00756055,  0.02175765, -0.0050137 ,
        0.00241255, -0.03810823, -0.03386266,  0.01231019, -0.00621936,
       -0.00252419,  0.02280569,  0.00992453,  0.02770403,  0.00233192,
        0.0008545 , -0.01462698,  0.00454278,  0.0381292 , -0.02945416,
       -0.00305543, -0.00690968,  0.00144188,  0.00424266,  0.00391074,
        0.01969502,  0.02517333,  0.00875261,  0.02937791,  0.03234404,
       -0.01116276, -0.00362578,  0.00483239, -0.02257918,  0.00123061,
        0.00324584,  0.00432153,  0.01332884,  0.03186348, -0.04119627,
        0.01329033,  0.01382102, -0.01637722,  0.01464139,  0.02203292,
        0.0312229 ,  0.00636201, -0.00044287, -0.00489291,  0.0210293 ,
       -0.00379244, -0.01577058,  0.02185207,  0.02576622, -0.0054543 ,
       -0.03115215, -0.00337738, -0.01589811, -0.01608399, -0.0141606 ,
        0.0508234 ,  0.00775024,  0.00352813,  0.00573649, -0.02131752,
        0.01166397,  0.00940598,  0.04075769, -0.04704212,  0.0101376 ,
        0.01208556,  0.00402935,  0.0093914 ,  0.00136144,  0.03284211,
        0.01000613, -0.00563702,  0.00847146,  0.03236216, -0.01626745,
        0.04095127,  0.02858841,  0.0248084 ,  0.00455458,  0.01467448],
      dtype=float32)

Hence, I really want to know why the FastText module trained from sentiment140 could do this. Thank you!

gojomo · Accepted Answer

In fact, the proper behavior for a FastText model, based on the behavior of Facebook's original/reference implementation, is to always return a vector for an out-of-vocabulary word.

Essentially, if none of the supplied string's character n-grams are present, a vector will still be synthesized from whatever random vectors happen to be at the same lookup slots in the model's fixed-size collection of n-gram vectors.

In Gensim up through at least 3.7.1, the FastText class will throw a KeyError: 'all ngrams for word _____ absent from model' error if none of an out-of-vocabulary word's n-grams are present – but that's a buggy behavior that will be reversed, to match Facebook's FastText, in a future Gensim release. (The PR to correct this behavior has been merged to Gensim's develop branch and thus should take effect in the next release after 3.7.1.)

I'm not sure why you're not getting such an error with the specific model and dataset you've described. Perhaps your fasttext_model was actually trained with different text than you think? Or, trained with a very-small non-default min_n parameter, such that a single 哈 appearing inside the sentiment140 data is enough to contribute to a synthesized vector for 哈哈哈哈?

But given that the standard FastText behavior is to always report some synthesized vector, and Gensim will match that behavior in a future release, you shouldn't count on getting an error here. Expect to get back an essentially-random vector for completely unknown words with no resemblance to training data.

Why the FastText word embedding could generate the representation of a word from another language?

Answers (1)

Related Questions