How to get a dataframe from word2vec model using spark

Question

I am currently working on a sparkling water application and I am a total beginner in spark and h2o.

What I want to do:

loading a input textfile
create a word2vec model
create a dataframe with a column word and a column Vector
using the dataframe as input for h2o

By creating the model i get a map, but i don't know how to create a dataframe of it. The output should look like that:

word | Vector

assert | [0.3, 0.4.....]

sense | [0.6, 0.2.....] and so on.

This is my code so far:

from pyspark import SparkContext
from pyspark.mllib.feature import Word2Vec
from pysparkling import *
import h2o

from pyspark.sql import SQLContext
from pyspark.mllib.linalg import Vectors
from pyspark.sql import Row


# Starting h2o application on spark cluster
hc = H2OContext(sc).start()

# Loading input file
inp = sc.textFile("examples/custom/text8.txt").map(lambda row: row.split(" "))

# building the word2vec model with a vector size of 10
word2vec = Word2Vec()
model = word2vec.setVectorSize(10).fit(inp)

# Sanity check
model.findSynonyms("property",5)

# assign vector representation (map to variable
wordVectorsDF = model.getVectors()

# Transform wordVectorsDF word into dataframe

Is there any approach to that or functions provided by spark?

Thanks in advance

sedioben · Accepted Answer

I found out that there are two libraries for a Word2Vec transformation - I don't know why.

from pyspark.mllib.feature import Word2Vec
from pyspark.ml.feature import Word2Vec

The second line returns a data frame with the function getVectors()and has diffenrent parameters for building a model from the first line.

Maybe somebody can comment on that concerning the 2 different libraries.

Thanks in advance.

How to get a dataframe from word2vec model using spark

Answers (2)

Related Questions