Reputation: 1642
I will like to plot in a simple vector space graph the similarity between different words. I have calculated them using the model word2vec
given by gensim but I cannot find any graphical examples in the literature. My code is as follows:
## Libraries to download
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from gensim import corpora, models
import gensim
import json
import nltk
import re
import pandas
appended_data = []
#for i in range(20014,2016):
# df0 = pandas.DataFrame([json.loads(l) for l in open('SDM_%d.json' % i)])
# appended_data.append(df0)
for i in range(2005,2016):
if i > 2013:
df0 = pandas.DataFrame([json.loads(l) for l in open('SDM_%d.json' % i)])
appended_data.append(df0)
df1 = pandas.DataFrame([json.loads(l) for l in open('Scot_%d.json' % i)])
df2 = pandas.DataFrame([json.loads(l) for l in open('APJ_%d.json' % i)])
df3 = pandas.DataFrame([json.loads(l) for l in open('TH500_%d.json' % i)])
df4 = pandas.DataFrame([json.loads(l) for l in open('DRSM_%d.json' % i)])
appended_data.append(df1)
appended_data.append(df2)
appended_data.append(df3)
appended_data.append(df4)
appended_data = pandas.concat(appended_data)
# doc_set = df1.body
doc_set = appended_data.body
## Building the deep learning model
import itertools
sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
sentenized = doc_set.apply(sent_detector.tokenize)
sentences = itertools.chain.from_iterable(sentenized.tolist()) # just to flatten
from gensim.models import word2vec
result = []
for sent in sentences:
result += [nltk.word_tokenize(sent)]
model = gensim.models.Word2Vec(result)
In a simple vector space graph, I will like to place the following words: bank, finance, market, property, oil, energy, business and economy. I can easily calculate the similarity of these pairs of words with the function:
model.similarity('bank', 'property')
0.25089364531360675
Thanks a lot
Upvotes: 2
Views: 4401
Reputation: 66
For plotting all the word-vectors in your Word2Vec model, you need to perform Dimensionality reduction. You can use TSNE tool from python's sklearn to visualise multi-dimensional vectors in 2-D space.
t-distributed Stochastic Neighbor Embedding.
import sklearn.manifold.TSNE
tsne = sklearn.manifold.TSNE(n_components = 0 , random_state = 0)
all_vector_matrix = model.syn0
all_vector_matrix_2d = tsne.fit_transform(all_vector_matrix)
This will give you a 2-D similarity matrix which you can further parse through pandas and then plot using seaborn and matplotlib's pyplot function.
Upvotes: 5