Skip-gram with Word2Vec not working properly

I am trying to build a word2vec similarity dictionary. I was able to build one dictionary but the similarities are not being populated correctly. Am I missing anything in my code?

Input sample data Text

TAK PO LUN UNIT 3 15/F WAYSON COMMERCIAL G 28 CONNAUGHT RD WEST SHEUNG WAN
- EDDY SUSANTO YAHYA ROOM 1503-05 WESTERN CENTRE 40-50 DES VOEUX W. SHEUNG WAN
DNA FINANCIAL SYSTEMS INC UNIT 10 19F WAYSON COMMERCIAL 28 CONNAUGHT RD SHEUNG WAN
G/F 60 PO HING FONG SHEUNG WAN
10B CENTRAL MANSION 270 QUEENS RD CENTRAL SHEUNG WAN
AKAMAI INTERNATIONAL BV C/O IADVANTAGE 28/F OF MEGA I-ADVANTAGE 399 CHAI WAN RD CHAI WAN HONG KO HONG KONG
VICTORIA CHAN F/5E 1-3 FLEMING RD WANCHI WAN CHAI
HISTREND 365 5/F FOO TAK BUILDING 365 HENNESSY RD WAN CHAI H WAN CHAI
ROOM 1201 12F CHINACHEM JOHNSO PLAZA 178 186 JOHNSTON RD WAN CHAI
LUEN WO BUILDING 339 HENNESSY RD 9 FLOOR WAN CHAI HONG KONG

My code:

import gensim
from gensim import corpora,similarities,models
class AccCorpus(object):

   def __init__(self):
       self.path = ''

   def __iter__(self):
       for sentence in data["Adj_Addr"]:
           yield [word.lower() for word in sentence.split()]

   def build_corpus():
       model = gensim.models.word2vec.Word2Vec(alpha=0.05, min_alpha=0.05,window=2,sg=1)
       sentences = AccCorpus()
       model.build_vocab(sentences)
       for epoch in range(1):
           model.train(sentences,total_examples=model.corpus_count, epochs=model.iter)
           model.alpha -= 0.002  # decrease the learning rate
           model.min_alpha = model.alpha  # fix the learning rate, no decay

       model_name = "word2vec_model"
       model.save(model_name)
       return model

model=build_corpus()

My results:

model.most_similar("wan")
[('want', 0.6867533922195435),
 ('puiwan', 0.6323356032371521),
 ('wan.', 0.6132887005805969),
 ('wanstreet', 0.5945449471473694),
 ('aupuiwan', 0.594132661819458),
 ('futan', 0.5883135199546814),
 ('fotan', 0.5817855000495911),
 ('shanmei', 0.5807071924209595),
 ('30-33', 0.5789132118225098),
 ('61-63au', 0.5711270570755005)]

Here are my expected outputs for the similarity: sheungwan, wanchai, chaiwan. I am guessing my skipgrams are not working properly. How can I fix this?

Upvotes: 1

Views: 647

Answers (1)

Maxim
Maxim

Reputation: 53758

As already suggested in the comments, there's no need to tweak alpha and other internal parameters, unless you're sure it's necessary (in your case it's not, most probably).

You're getting a lot of extra results because it's in your data somewhere. I don't know what Adj_Addr is, but it's not just the text you provided: puiwan, futan, fotan, ... - none of this is in the text above.

Here's the clean test that works just as you want it to work (I left only relevant part, feel free to add sg=1 - works as well):

import gensim

text = """TAK PO LUN UNIT 3 15/F WAYSON COMMERCIAL G 28 CONNAUGHT RD WEST SHEUNG WAN
- EDDY SUSANTO YAHYA ROOM 1503-05 WESTERN CENTRE 40-50 DES VOEUX W. SHEUNG WAN
DNA FINANCIAL SYSTEMS INC UNIT 10 19F WAYSON COMMERCIAL 28 CONNAUGHT RD SHEUNG WAN
G/F 60 PO HING FONG SHEUNG WAN
10B CENTRAL MANSION 270 QUEENS RD CENTRAL SHEUNG WAN
AKAMAI INTERNATIONAL BV C/O IADVANTAGE 28/F OF MEGA I-ADVANTAGE 399 CHAI WAN RD CHAI WAN HONG KO HONG KONG
VICTORIA CHAN F/5E 1-3 FLEMING RD WANCHI WAN CHAI
HISTREND 365 5/F FOO TAK BUILDING 365 HENNESSY RD WAN CHAI H WAN CHAI
ROOM 1201 12F CHINACHEM JOHNSO PLAZA 178 186 JOHNSTON RD WAN CHAI
LUEN WO BUILDING 339 HENNESSY RD 9 FLOOR WAN CHAI HONG KONG"""

sentences = text.split('\n')

class AccCorpus(object):
  def __init__(self):
    self.path = ''

  def __iter__(self):
    for sentence in sentences:
      yield [word.lower() for word in sentence.split()]

def build_corpus():
  model = gensim.models.word2vec.Word2Vec()
  sentences = AccCorpus()
  model.build_vocab(sentences)
  model.train(sentences, total_examples=model.corpus_count, epochs=model.iter)
  return model

model = build_corpus()
print(model.most_similar("wan"))

The result is:

[('chai', 0.04687393456697464), ('rd', -0.03181878849864006), ('sheung', -0.06769674271345139)]

Upvotes: 2

Related Questions