Economist_Ayahuasca
Economist_Ayahuasca

Reputation: 1642

Graphical plot of words similarity given by Word2Vec

I will like to plot in a simple vector space graph the similarity between different words. I have calculated them using the model word2vec given by gensim but I cannot find any graphical examples in the literature. My code is as follows:

## Libraries to download
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from gensim import corpora, models
import gensim

import json
import nltk
import re
import pandas


appended_data = []


#for i in range(20014,2016):
#    df0 = pandas.DataFrame([json.loads(l) for l in open('SDM_%d.json' % i)])
#    appended_data.append(df0)

for i in range(2005,2016):
    if i > 2013:
        df0 = pandas.DataFrame([json.loads(l) for l in open('SDM_%d.json' % i)])
        appended_data.append(df0)
    df1 = pandas.DataFrame([json.loads(l) for l in open('Scot_%d.json' % i)])
    df2 = pandas.DataFrame([json.loads(l) for l in open('APJ_%d.json' % i)])
    df3 = pandas.DataFrame([json.loads(l) for l in open('TH500_%d.json' % i)])
    df4 = pandas.DataFrame([json.loads(l) for l in open('DRSM_%d.json' % i)])
    appended_data.append(df1)
    appended_data.append(df2)
    appended_data.append(df3)
    appended_data.append(df4)


appended_data = pandas.concat(appended_data)
# doc_set = df1.body

doc_set = appended_data.body

## Building the deep learning model
import itertools

sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
sentenized = doc_set.apply(sent_detector.tokenize)
sentences = itertools.chain.from_iterable(sentenized.tolist()) # just to flatten

from gensim.models import word2vec


result = []
for sent in sentences:
    result += [nltk.word_tokenize(sent)]

model = gensim.models.Word2Vec(result)

In a simple vector space graph, I will like to place the following words: bank, finance, market, property, oil, energy, business and economy. I can easily calculate the similarity of these pairs of words with the function:

model.similarity('bank', 'property')
0.25089364531360675

Thanks a lot

Upvotes: 2

Views: 4401

Answers (1)

Vyom Sharma
Vyom Sharma

Reputation: 66

For plotting all the word-vectors in your Word2Vec model, you need to perform Dimensionality reduction. You can use TSNE tool from python's sklearn to visualise multi-dimensional vectors in 2-D space.

t-distributed Stochastic Neighbor Embedding.

import sklearn.manifold.TSNE

tsne = sklearn.manifold.TSNE(n_components = 0 , random_state = 0)
all_vector_matrix = model.syn0
all_vector_matrix_2d = tsne.fit_transform(all_vector_matrix)

This will give you a 2-D similarity matrix which you can further parse through pandas and then plot using seaborn and matplotlib's pyplot function.

Upvotes: 5

Related Questions