Reputation: 67
I am trying to run Spark MLlibs word2vec implementation.I am using scala for this.My input for the model is Array of Sequence of strings.It looks as shown below
scala> f.take(5)
res11: Array[org.apache.spark.sql.Row] = Array([WrappedArray(0_42)], [WrappedArray(big, baller, shoe, ?)], [WrappedArray(since, eliud, win, ,, quick, fact, from, runner, from, country, kalenjins, !, write, ., happy, quick, fact, kalenjins, location, :, kenya, (, kenya's, western, highland, rift, valley, ), population, :, 4, ., 9, million, ;, compose, 11, subtribes, language, :, kalenjin, ;, swahili, ;, english, church, :, christianity, ~, africa, inland, church, [, aic, ],, church, province, kenya, [, cpk, ],, roman, catholic, church, ;, islam, translation, :, kalenjin, translate, ", tell, ", formation, :, wwii, ,, gikuyu, tribal, member, wish, separate, create, identity, ., later, ,, student, attend, alliance, high, school, (, first, british, public, school, kenya, ), form, tribe, become, future, kal...
val v=f.map(l=>Seq(l.toString))
scala> v.take(5)
res31: Array[Seq[String]] = Array(List([WrappedArray(0_42)]), List ([WrappedArray(big, baller, shoe, ?)]), List([WrappedArray(since, eliud, win, ,, quick, fact, from, runner, from, country, kalenjins, !, write, ., happy, quick, fact, kalenjins, location, :, kenya, (, kenya's, western, highland, rift, valley, ), population, :, 4, ., 9, million, ;, compose, 11, subtribes, language, :, kalenjin, ;, swahili, ;, english, church, :, christianity, ~, africa, inland, church, [, aic, ],, church, province, kenya, [, cpk, ],, roman, catholic, church, ;, islam, translation, :, kalenjin, translate, ", tell, ", formation, :, wwii, ,, gikuyu, tribal, member, wish, separate, create, identity, ., later, ,, student, attend, alliance, high, school, (, first, british, public, school, kenya, ), form, ....
Each sentence is in a separate list as shown above.I run the model by giving v as the input
scala> val model = word2vec.fit(v)
But the output of this model does not look to be proper. When I save the model and try to read its parquet file(a) I get the below results.
model.save(sc, "myModelPath")
val a=sqlContext.read.parquet("myModelPath")
a.show(20,false)
+--------------------------------------------------------------------+
|word |
+--------------------------------------------------------------------+
|[WrappedArray(coffee, machine)] |
|[WrappedArray(good, experience)] |
|[WrappedArray(love, room, !)] |
|[WrappedArray(parking, .)] |
|[WrappedArray(breakfast, great, !)] |
|[WrappedArray(bed, comfortable, room, spacious, .)] |
This word2vec model instead of creating the vectors for each word is creating vectors for array of words. I am not sure what is the correct way of feeding input to this model and how does it break sentences or words.
Upvotes: 0
Views: 208
Reputation: 2155
I'll bet that if you look at v.first
you'll see List([WrappedArray(0_42)])
and if you look at v.first.head
you'll see [WrappedArray(0_42)]
. But v.first.head
is a String, and what you're actually seeing is "[WrappedArray(0_42)]"
. There is no WrappedArray, just a string. Perhaps you accidentally called toString
on a WrappedArray
(or fell victim to an implicit conversion to String). Word2Vec is actually seeing strings like "[WrappedArray(coffee, machine)]"
in its input, and generating a model based on those strings.
UPDATE
If I have your types right, f is a DataFrame
where each Row
contains a single field holding a Seq[String]
(which is actually a WrappedArray
).
So, instead of
val v=f.map(l=>Seq(l.toString))
what you should be doing to extract that field is
val v = f.map(r => r.getSeq[String](0))
This produces a Dataset[Seq[String]]
that should be suitable for input to Word2Vec
.
Upvotes: 1