Reputation: 193
I am trying to generate word vectors using PySpark. Using gensim I can see the words and the closest words as below:
sentences = open(os.getcwd() + "/tweets.txt").read().splitlines()
w2v_input=[]
for i in sentences:
tokenised=i.split()
w2v_input.append(tokenised)
model = word2vec.Word2Vec(w2v_input)
for key in model.wv.vocab.keys():
print key
print model.most_similar(positive=[key])
Using PySpark
inp = sc.textFile("tweet.txt").map(lambda row: row.split(" "))
word2vec = Word2Vec()
model = word2vec.fit(inp)
How can I generate the words from the vector space in model? That is the pyspark equivalent of the gensim model.wv.vocab.keys()
?
Background: I need to store the words and the synonyms from the model in a map so I can use them later for finding the sentiment of a tweet. I cannot reuse the word-vector model in the map functions in pyspark as the model belongs to the spark context (error pasted below). I want the pyspark word2vec version instead of gensim because it provides better synonyms for certain test words.
Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation.SparkContext can only be used on the driver, not in code that it run on workers.
Any alternative solution is also welcome.
Upvotes: 4
Views: 7162
Reputation: 338
And as suggested here, if you want to include all the words in your document set the MinCount parameter accordingly (default=5):
word2vec = Word2Vec()
word2vec.setMinCount(1)
Upvotes: 0
Reputation: 60319
The equivalent command in Spark is model.getVectors()
, which again returns a dictionary. Here is a quick toy example with only 3 words (alpha, beta, charlie
), adapted from the documentation:
sc.version
# u'2.1.1'
from pyspark.mllib.feature import Word2Vec
sentence = "alpha beta " * 100 + "alpha charlie " * 10
localDoc = [sentence, sentence]
doc = sc.parallelize(localDoc).map(lambda line: line.split(" "))
word2vec = Word2Vec()
model = word2vec.fit(doc)
model.getVectors().keys()
# [u'alpha', u'beta', u'charlie']
Regarding finding synonyms, you may find another answer of mine useful.
Regarding the error you mention and a possible workaround, have a look at this answer of mine.
Upvotes: 5