JarochoEngineer
JarochoEngineer

Reputation: 1787

Word Embedding Model

I have been searching and attempting to implement a word embedding model to predict similarity between words. I have a dataset made up 3,550 company names, the idea is that the user can provide a new word (which would not be in the vocabulary) and calculate the similarity between the new name and existing ones.

During preprocessing I got rid of stop words and punctuation (hyphens, dots, commas, etc). In addition, I applied stemming and separated prefixes with the hope to get more precision. Then words such as BIOCHEMICAL ended up as BIO CHEMIC which is the word divided in two (prefix and stem word)

The average company name length is made up 3 words with the following frequency:

enter image description here

The tokens that are the result of preprocessing are sent to word2vec:

#window: Maximum distance between the current and predicted word within a sentence
#min_count: Ignores all words with total frequency lower than this.
#workers: Use these many worker threads to train the model
#sg: The training algorithm, either CBOW(0) or skip gram(1). Default is 0s
word2vec_model = Word2Vec(prepWords,size=300, window=2, min_count=1, workers=7, sg=1)

After the model included all the words in the vocab , the average sentence vector is calculated for each company name: df['avg_vector']=df2.apply(lambda row : avg_sentence_vector(row, model=word2vec_model, num_features=300, index2word_set=set(word2vec_model.wv.index2word)).tolist())

Then, the vector is saved for further lookups:

##Saving name and vector values in file
df.to_csv('name-submission-vectors.csv',encoding='utf-8', index=False)

If a new company name is not included in the vocab after preprocessing (removing stop words and punctuation), then I proceed to create the model again and calculate the average sentence vector and save it again.

I have found this model is not working as expected. As an example, calculating the most similar words pet is getting the following results:

ms=word2vec_model.most_similar('pet')

('fastfood', 0.20879755914211273)
('hammer', 0.20450574159622192)
('allur', 0.20118337869644165)
('wright', 0.20001833140850067)
('daili', 0.1990675926208496)
('mgt', 0.1908089816570282)
('mcintosh', 0.18571510910987854)
('autopart', 0.1729743778705597)
('metamorphosi', 0.16965581476688385)
('doak', 0.16890916228294373)

In the dataset, I have words such as paws or petcare, but other words are creating relationships with pet word.

This is the distribution of the nearer words for pet:

enter image description here

On the other hand, when I used the GoogleNews-vectors-negative300.bin.gz, I could not add new words to the vocab, but the similarity between pet and words around was as expected:

ms=word2vec_model.most_similar('pet')
('pets', 0.771199643611908)
('Pet', 0.723974347114563)
('dog', 0.7164785265922546)
('puppy', 0.6972636580467224)
('cat', 0.6891531348228455)
('cats', 0.6719794869422913)
('pooch', 0.6579219102859497)
('Pets', 0.636363685131073)
('animal', 0.6338439583778381)
('dogs', 0.6224827170372009)

This is the distribution of the nearest words:

enter image description here

I would like to get your advice about the following:

Thanks

Upvotes: 1

Views: 831

Answers (3)

gojomo
gojomo

Reputation: 54173

3500 texts (company names) of just ~3 words each is only around 10k total training words, with a much smaller vocabulary of unique words.

That's very, very small for word2vec & related algorithms, which rely on lots of data, and sufficiently-varied data, to train-up useful vector arrangements.

You may be able to squeeze some meaningful training from limited data by using far more training epochs than the default epochs=5, and far smaller vectors than the default size=100. With those sorts of adjustments, you may start to see more meaningful most_similar() results.

But, it's unclear that word2vec, and specifically word2vec in your averaging-of-a-name's-words comparisons, is matched to your end goals.

Word2vec needs lots of data, doesn't look at subword units, and can't say anything about word-tokens not seen during training. An average-of-many-word-vectors can often work as an easy baseline for comparing multiword texts, but might also dilute some word's influence compared to other methods.

Things to consider might include:

  • Word2vec-related algorithms like FastText that also learn vectors for subword units, and can thus bootstrap not-so-bad guess vectors for words not seen in training. (But, these are also data hungry, and to use on a small dataset you'd again want to reduce vector size, increase epochs, and additionally shrink the number of buckets used for subword learning.)

  • More sophisticated comparisons of multi-word texts, like "Word Mover's Distance". (That can be quite expensive on longer texts, but for names/titles of just a few words may be practical.)

  • Finding more data that's compatible with your aims for a stronger model. A larger database of company names might help. If you just want your analysis to understand English words/roots, more generic training texts might work too.

  • For many purposes, a mere lexicographic comparison - edit distances, count of shared character-n-grams – may be helpful too, though it won't detect all synonyms/semantically-similar words.

Upvotes: 1

If you only want to compute similarity between words, probably you don't need to insert new words in your vocabulary.

By eye, I think you can also use FastText without the need to stem the words. It also computes vectors for unknown words.

From FastText FAQ:

One of the key features of fastText word representation is its ability to produce vectors for any words, even made-up ones. Indeed, fastText word vectors are built from vectors of substrings of characters contained in it. This allows to build vectors even for misspelled words or concatenation of words.

FastText seems to be useful for your purpose. For your task, you can follow FastText supervised tutorial.

If your corpus proves to be too small, you can build your model starting from availaible pretrained vectors (pretrainedVectors parameter).

Upvotes: 1

Has QUIT--Anony-Mousse
Has QUIT--Anony-Mousse

Reputation: 77454

Word2vec does not generalize to unseen words.

It does not even work well for wards that are seen but rare. It really depends on having many many examples of word usage. Furthermore a you need enough context left and right, but you only use company names - these are too short. That is likely why your embeddings perform so poorly: too little data and too short texts.

Hence, it is the wrong approach for you. Retraining the model with the new company name is not enough - you still only have one data point. You may as well leave out unseen words, word2vec cannot work better than that even if you retrain.

Upvotes: 1

Related Questions