Reputation: 7
I am trying to use W2V. I saved my preprocessed data as a pandas dataframe, and I want to apply the word2vec algorithm to my preprocessed data.
This is my data. http://naver.me/IFjLAHld
This is my code.
from gensim.models.word2vec import Word2Vec
import pandas as pd
import numpy as np
df = pd.read_excel('re_nlp0820.xlsx')
model = Word2Vec(df['nlp'],
sg=1,
window=3,
min_count=1,
workers=4,
iter=1)
model.init_sims(replace=True)
model_result1 = model.wv.most_similar('국민', topn =20)
print(model_result1)
Please, help me
Upvotes: 0
Views: 1890
Reputation: 767
First you need to convert the data you are passing to the Word2Vec instance into a nested list where each list contains the tokenized form of the text. You can do so by:
from gensim.models.word2vec import Word2Vec
import pandas as pd
import numpy as np
import nltk
df = pd.read_excel('re_nlp0820.xlsx')
nlp = [nltk.word_tokenize(i) for i in df['nlp']]
model = Word2Vec(nlp,
sg=1,
window=3,
min_count=1,
workers=4,
iter=1)
model.init_sims(replace=True)
model_result1 = model.wv.most_similar('국민', topn =20)
print(model_result1)
Upvotes: 3
Reputation: 54173
Gensim's Word2Vec
needs as its training corpus a re-iterable sequence, where each item is a list-of-words.
You df['nlp']
is probably just a sequence of strings, so it's not in the right format. You should make sure each of its items is broken into a Python list
that has your desired words as individual strings.
(Separately: min_count=1
is almost always a bad idea with this algorithm, which gives better results if rare words with few usage examples are discarded. And, you shouldn't need to call .init_sims()
at all.)
Upvotes: 0