Reputation: 945
I am currently working on a sparkling water application and I am a total beginner in spark and h2o.
What I want to do:
By creating the model i get a map, but i don't know how to create a dataframe of it. The output should look like that:
word | Vector
assert | [0.3, 0.4.....]
sense | [0.6, 0.2.....] and so on.
This is my code so far:
from pyspark import SparkContext
from pyspark.mllib.feature import Word2Vec
from pysparkling import *
import h2o
from pyspark.sql import SQLContext
from pyspark.mllib.linalg import Vectors
from pyspark.sql import Row
# Starting h2o application on spark cluster
hc = H2OContext(sc).start()
# Loading input file
inp = sc.textFile("examples/custom/text8.txt").map(lambda row: row.split(" "))
# building the word2vec model with a vector size of 10
word2vec = Word2Vec()
model = word2vec.setVectorSize(10).fit(inp)
# Sanity check
model.findSynonyms("property",5)
# assign vector representation (map to variable
wordVectorsDF = model.getVectors()
# Transform wordVectorsDF word into dataframe
Is there any approach to that or functions provided by spark?
Thanks in advance
Upvotes: 2
Views: 2948
Reputation: 945
I found out that there are two libraries for a Word2Vec transformation - I don't know why.
from pyspark.mllib.feature import Word2Vec
from pyspark.ml.feature import Word2Vec
The second line returns a data frame with the function getVectors()
and has diffenrent parameters for building a model from the first line.
Maybe somebody can comment on that concerning the 2 different libraries.
Thanks in advance.
Upvotes: 3
Reputation: 15141
First of all in H2O we don't support a Vector
column type, you'd have to make a frame like this:
word | V1 | V2 | ...
assert | 0.3 | 0.4 | ...
sense | 0.6 | 0.2 | ...
Now for the actual question - no, since it's a Scala Map
, we provide ways to create frames from data sources (files on HDFS/S3, databases etc) or conversions from RDDs/DataFrames but not from Java/Scala collections. Writing one would be possible but quite cumbersome.
Not the most performant solution but the easiest code-wise would be to make a DF (or RDD) first (by running sc.parallelize
on map.toSeq
) and then convert it to an H2OFrame:
import hc._
val wordsDF = sc.parallelize(wordVectorsDF.toSeq).toDF
val h2oFrame = asH2OFrame(wordsDF)
Upvotes: -1