vish
vish

Reputation: 67

Spark MLin Word2vec

I am trying to run Spark MLlibs word2vec implementation.I am using scala for this.My input for the model is Array of Sequence of strings.It looks as shown below

scala> f.take(5)
res11: Array[org.apache.spark.sql.Row] = Array([WrappedArray(0_42)], [WrappedArray(big, baller, shoe, ?)], [WrappedArray(since, eliud, win, ,, quick, fact, from, runner, from, country, kalenjins, !, write, ., happy, quick, fact, kalenjins, location, :, kenya, (, kenya's, western, highland, rift, valley, ), population, :, 4, ., 9, million, ;, compose, 11, subtribes, language, :, kalenjin, ;, swahili, ;, english, church, :, christianity, ~, africa, inland, church, [, aic, ],, church, province, kenya, [, cpk, ],, roman, catholic, church, ;, islam, translation, :, kalenjin, translate, ", tell, ", formation, :, wwii, ,, gikuyu, tribal, member, wish, separate, create, identity, ., later, ,, student, attend, alliance, high, school, (, first, british, public, school, kenya, ), form, tribe, become, future, kal...

val v=f.map(l=>Seq(l.toString))
scala> v.take(5)
res31: Array[Seq[String]] = Array(List([WrappedArray(0_42)]), List  ([WrappedArray(big, baller, shoe, ?)]), List([WrappedArray(since, eliud, win, ,, quick, fact, from, runner, from, country, kalenjins, !, write, ., happy, quick, fact, kalenjins, location, :, kenya, (, kenya's, western, highland, rift, valley, ), population, :, 4, ., 9, million, ;, compose, 11, subtribes, language, :, kalenjin, ;, swahili, ;, english, church, :, christianity, ~, africa, inland, church, [, aic, ],, church, province, kenya, [, cpk, ],, roman, catholic, church, ;, islam, translation, :, kalenjin, translate, ", tell, ", formation, :, wwii, ,, gikuyu, tribal, member, wish, separate, create, identity, ., later, ,, student, attend, alliance, high, school, (, first, british, public, school, kenya, ), form, ....

Each sentence is in a separate list as shown above.I run the model by giving v as the input

scala> val model = word2vec.fit(v)

But the output of this model does not look to be proper. When I save the model and try to read its parquet file(a) I get the below results.

   model.save(sc, "myModelPath")
   val a=sqlContext.read.parquet("myModelPath")
   a.show(20,false)
+--------------------------------------------------------------------+
|word                                                                |
+--------------------------------------------------------------------+
|[WrappedArray(coffee, machine)]                                     |
|[WrappedArray(good, experience)]                                    |
|[WrappedArray(love, room, !)]                                       |
|[WrappedArray(parking, .)]                                          |
|[WrappedArray(breakfast, great, !)]                                 |
|[WrappedArray(bed, comfortable, room, spacious, .)]                 |

This word2vec model instead of creating the vectors for each word is creating vectors for array of words. I am not sure what is the correct way of feeding input to this model and how does it break sentences or words.

Upvotes: 0

Views: 208

Answers (1)

Joe Pallas
Joe Pallas

Reputation: 2155

I'll bet that if you look at v.first you'll see List([WrappedArray(0_42)]) and if you look at v.first.head you'll see [WrappedArray(0_42)]. But v.first.head is a String, and what you're actually seeing is "[WrappedArray(0_42)]". There is no WrappedArray, just a string. Perhaps you accidentally called toString on a WrappedArray (or fell victim to an implicit conversion to String). Word2Vec is actually seeing strings like "[WrappedArray(coffee, machine)]" in its input, and generating a model based on those strings.

UPDATE

If I have your types right, f is a DataFrame where each Row contains a single field holding a Seq[String] (which is actually a WrappedArray).

So, instead of

val v=f.map(l=>Seq(l.toString))

what you should be doing to extract that field is

val v = f.map(r => r.getSeq[String](0))

This produces a Dataset[Seq[String]] that should be suitable for input to Word2Vec.

Upvotes: 1

Related Questions