Reputation: 103
I am trying to build a word2vec similarity dictionary. I was able to build one dictionary but the similarities are not being populated correctly. Am I missing anything in my code?
Input sample data Text
TAK PO LUN UNIT 3 15/F WAYSON COMMERCIAL G 28 CONNAUGHT RD WEST SHEUNG WAN
- EDDY SUSANTO YAHYA ROOM 1503-05 WESTERN CENTRE 40-50 DES VOEUX W. SHEUNG WAN
DNA FINANCIAL SYSTEMS INC UNIT 10 19F WAYSON COMMERCIAL 28 CONNAUGHT RD SHEUNG WAN
G/F 60 PO HING FONG SHEUNG WAN
10B CENTRAL MANSION 270 QUEENS RD CENTRAL SHEUNG WAN
AKAMAI INTERNATIONAL BV C/O IADVANTAGE 28/F OF MEGA I-ADVANTAGE 399 CHAI WAN RD CHAI WAN HONG KO HONG KONG
VICTORIA CHAN F/5E 1-3 FLEMING RD WANCHI WAN CHAI
HISTREND 365 5/F FOO TAK BUILDING 365 HENNESSY RD WAN CHAI H WAN CHAI
ROOM 1201 12F CHINACHEM JOHNSO PLAZA 178 186 JOHNSTON RD WAN CHAI
LUEN WO BUILDING 339 HENNESSY RD 9 FLOOR WAN CHAI HONG KONG
My code:
import gensim
from gensim import corpora,similarities,models
class AccCorpus(object):
def __init__(self):
self.path = ''
def __iter__(self):
for sentence in data["Adj_Addr"]:
yield [word.lower() for word in sentence.split()]
def build_corpus():
model = gensim.models.word2vec.Word2Vec(alpha=0.05, min_alpha=0.05,window=2,sg=1)
sentences = AccCorpus()
model.build_vocab(sentences)
for epoch in range(1):
model.train(sentences,total_examples=model.corpus_count, epochs=model.iter)
model.alpha -= 0.002 # decrease the learning rate
model.min_alpha = model.alpha # fix the learning rate, no decay
model_name = "word2vec_model"
model.save(model_name)
return model
model=build_corpus()
My results:
model.most_similar("wan")
[('want', 0.6867533922195435),
('puiwan', 0.6323356032371521),
('wan.', 0.6132887005805969),
('wanstreet', 0.5945449471473694),
('aupuiwan', 0.594132661819458),
('futan', 0.5883135199546814),
('fotan', 0.5817855000495911),
('shanmei', 0.5807071924209595),
('30-33', 0.5789132118225098),
('61-63au', 0.5711270570755005)]
Here are my expected outputs for the similarity: sheungwan, wanchai, chaiwan. I am guessing my skipgrams are not working properly. How can I fix this?
Upvotes: 1
Views: 647
Reputation: 53758
As already suggested in the comments, there's no need to tweak alpha
and other internal parameters, unless you're sure it's necessary (in your case it's not, most probably).
You're getting a lot of extra results because it's in your data somewhere. I don't know what Adj_Addr
is, but it's not just the text you provided: puiwan
, futan
, fotan
, ... - none of this is in the text above.
Here's the clean test that works just as you want it to work (I left only relevant part, feel free to add sg=1
- works as well):
import gensim
text = """TAK PO LUN UNIT 3 15/F WAYSON COMMERCIAL G 28 CONNAUGHT RD WEST SHEUNG WAN
- EDDY SUSANTO YAHYA ROOM 1503-05 WESTERN CENTRE 40-50 DES VOEUX W. SHEUNG WAN
DNA FINANCIAL SYSTEMS INC UNIT 10 19F WAYSON COMMERCIAL 28 CONNAUGHT RD SHEUNG WAN
G/F 60 PO HING FONG SHEUNG WAN
10B CENTRAL MANSION 270 QUEENS RD CENTRAL SHEUNG WAN
AKAMAI INTERNATIONAL BV C/O IADVANTAGE 28/F OF MEGA I-ADVANTAGE 399 CHAI WAN RD CHAI WAN HONG KO HONG KONG
VICTORIA CHAN F/5E 1-3 FLEMING RD WANCHI WAN CHAI
HISTREND 365 5/F FOO TAK BUILDING 365 HENNESSY RD WAN CHAI H WAN CHAI
ROOM 1201 12F CHINACHEM JOHNSO PLAZA 178 186 JOHNSTON RD WAN CHAI
LUEN WO BUILDING 339 HENNESSY RD 9 FLOOR WAN CHAI HONG KONG"""
sentences = text.split('\n')
class AccCorpus(object):
def __init__(self):
self.path = ''
def __iter__(self):
for sentence in sentences:
yield [word.lower() for word in sentence.split()]
def build_corpus():
model = gensim.models.word2vec.Word2Vec()
sentences = AccCorpus()
model.build_vocab(sentences)
model.train(sentences, total_examples=model.corpus_count, epochs=model.iter)
return model
model = build_corpus()
print(model.most_similar("wan"))
The result is:
[('chai', 0.04687393456697464), ('rd', -0.03181878849864006), ('sheung', -0.06769674271345139)]
Upvotes: 2