Doosan Paik
Doosan Paik

Reputation: 7

How to apply pandas data on word2vec

I am trying to use W2V. I saved my preprocessed data as a pandas dataframe, and I want to apply the word2vec algorithm to my preprocessed data.

This is my data. http://naver.me/IFjLAHld

This is my code.

from gensim.models.word2vec import Word2Vec
import pandas as pd
import numpy as np

df = pd.read_excel('re_nlp0820.xlsx')

model = Word2Vec(df['nlp'],
                 sg=1,           
                 window=3,       
                 min_count=1,     
                 workers=4,       
                 iter=1)        
model.init_sims(replace=True) 

model_result1 = model.wv.most_similar('국민', topn =20)  
print(model_result1)

Please, help me

Upvotes: 0

Views: 1890

Answers (2)

spectre
spectre

Reputation: 767

First you need to convert the data you are passing to the Word2Vec instance into a nested list where each list contains the tokenized form of the text. You can do so by:

from gensim.models.word2vec import Word2Vec
import pandas as pd
import numpy as np
import nltk

df = pd.read_excel('re_nlp0820.xlsx')

nlp = [nltk.word_tokenize(i) for i in df['nlp']]

model = Word2Vec(nlp,
                 sg=1,           
                 window=3,       
                 min_count=1,     
                 workers=4,       
                 iter=1)        
model.init_sims(replace=True) 

model_result1 = model.wv.most_similar('국민', topn =20)  
print(model_result1)

Upvotes: 3

gojomo
gojomo

Reputation: 54173

Gensim's Word2Vec needs as its training corpus a re-iterable sequence, where each item is a list-of-words.

You df['nlp'] is probably just a sequence of strings, so it's not in the right format. You should make sure each of its items is broken into a Python list that has your desired words as individual strings.

(Separately: min_count=1 is almost always a bad idea with this algorithm, which gives better results if rare words with few usage examples are discarded. And, you shouldn't need to call .init_sims() at all.)

Upvotes: 0

Related Questions