user14251114
user14251114

Reputation:

training a Fasttext model

I want to train a Fasttext model in Python using the "gensim" library. First, I should tokenize each sentences to its words, hence converting each sentence to a list of words. Then, this list should be appended to a final list. Therefore, at the end, I will have a nested list containing all tokenized sentences:

word_punctuation_tokenizer = nltk.WordPunctTokenizer()
word_tokenized_corpus = []
for line in open('sentences.txt'):
   new = line.strip()
   new = word_punctuation_tokenizer.tokenize(new)
   if len(new) != 0:
       word_tokenized_corpus.append(new)

Then, the model should be built as the following:

embedding_size = 60
window_size = 40
min_word = 5
down_sampling = 1e-2
ft_model = FastText(word_tokenized_corpus,
                  size=embedding_size,
                  window=window_size,
                  min_count=min_word,
                  sample=down_sampling,
                  sg=1,
                  iter=100)

However, the number of sentences in "word_tokenized_corpus" is very large and the program can't handle it. Is it possible that I train the model by giving each tokenized sentence to it one by one, such as the following:?

 for line in open('sentences.txt'):
  new = line.strip()
  new = word_punctuation_tokenizer.tokenize(new)
  if len(new) != 0:
   ft_model = FastText(new,
              size=embedding_size,
              window=window_size,
              min_count=min_word,
              sample=down_sampling,
              sg=1,
              iter=100)

Does this make any difference to the final results? Is it possible to train the model without having to build such a large list and keeping it in the memory?

Upvotes: 0

Views: 2691

Answers (2)

David Beauchemin
David Beauchemin

Reputation: 256

If you want to use the default fasttext API, here is how you can do it:

root = "path/to/all/the/texts/in/a/single/txt/files.txt"

training_param = {
    'ws': window_size,
    'minCount': min_word,
    'dim': embedding_size,
    't': down_sampling,
    'epoch': 5,
    'seed': 0
}
# for all the parameters: https://fasttext.cc/docs/en/options.html

model = fasttext.train_unsupervised(path, **training_param)
model.save_model("embeddings_300_fr.bin")

The advantage of using the fasttext API is (1) implemented in C++ with a wrapper in Python (way faster than Gensim) (also multithreaded) (2) manages better the reading of the text. It is also possible to use it directly from the command line.

Upvotes: 1

user14251114
user14251114

Reputation:

Since the volume of the data is very high, it is better to convert the text file into a COR file. Then, read it in the following way:

from gensim.test.utils import datapath
corpus_file = datapath('sentences.cor')

As for the next step:

model = FastText(size=embedding_size,
                  window=window_size,
                  min_count=min_word,
                  sample=down_sampling,
                  sg=1,
                  iter=100)
model.build_vocab(corpus_file=corpus_file)
total_words = model.corpus_total_words
model.train(corpus_file=corpus_file, total_words=total_words, epochs=5)

Upvotes: 2

Related Questions