Perl Del Rey
Perl Del Rey

Reputation: 1059

Difference between Gensim's FastText and Facebook's FastText

I came upon the realization that there exists the original implementation of FastText here by which you can use fasttext.train_unsupervised in order to generate word vectors (see this link as an example). However, turns out that gensim also supports fasttext and its API is similar to that of word2vec. See example here.

I am wondering if there is a difference between the 2 implementations? The documentation was not clear but do they both mimic the paper Enriching Word Vectors with Subword Information? And if yes then why would one use gensim's fasttext over fasttext ?

Upvotes: 3

Views: 1498

Answers (2)

gojomo
gojomo

Reputation: 54173

Gensim intends to match the Facebook implementation, but with a few known or intentional differences. Specifically, Gensim doesn't implement:

  • the -supervised option, & specific-to-that-mode autotuning/quantization/pretrained-vectors options
  • word-multigrams (as controlled by the -wordNgrams paramerter to fasttext)
  • the plain softmax option for loss-optimization

Regarding options to -loss, I'm relatively sure that despite Facebook's command-line options docs indicating that the fasttext default is softmax, it is actually ns except when in -supervised mode, just like word2vec.c & Gensim. See for example this source code.

I suspect a future contribution to Gensim that adds wordNgrams support would be welcome, if that mode is useful to some users, and to match the reference implementation.

So far the choice of Gensim has been to avoid any supervised algorithms, so the -supervised mode is less-likely to appear in any future Gensim. (I'd argue for it, though, if a working implementation was contributed.)

The plain softmax mode is so much slower on typical large output vocabularies that few non-academic projects would want to use it over hs or ns. (It may still be practical with a smaller-number of output-labels, as in -supervised mode, though.)

Upvotes: 2

Perl Del Rey
Perl Del Rey

Reputation: 1059

I found 1 difference from the gensim's documentation:

word_ngrams (int, optional) – In Facebook’s FastText, “max length of word ngram” -
but gensim only supports the default of 1 (regular unigram word handling).

This means that gensim only supports unigrams, but no bigrams or trigrams.

Upvotes: 3

Related Questions