How does Spark Word2Vec merge each partition's results?

Question

Increasing numPartitions for Spark's Word2Vec makes it faster but less accurate since it fits each partition separately, reducing the context available for each word, before merging the results.

How exactly does it merge the results from multiple partitions? Is it just an average of the vectors? Looking to better understand how this affects the accuracy.

Looking at the source code, I think the merging is happening here:

val synAgg = partial.reduceByKey { case (v1, v2) =>
          blas.saxpy(vectorSize, 1.0f, v2, 1, v1, 1)
          v1
      }.collect()

Which looks like just a vector sum (effectively an average). partial comes from:

val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter =>
      // Each sentence will map to 0 or more Array[Int]
      sentenceIter.flatMap { sentence =>
        // Sentence of words, some of which map to a word index
        val wordIndexes = sentence.flatMap(bcVocabHash.value.get)
        // break wordIndexes into trunks of maxSentenceLength when has more
        wordIndexes.grouped(maxSentenceLength).map(_.toArray)
      }
    }
val newSentences = sentences.repartition(numPartitions).cache()
val partial = newSentences.mapPartitionsWithIndex { case (idx, iter) =>
// ... long calculation (skip-gram training, etc.)
}

But I'm not a Word2Vec/Spark ML/Scala expert, so hoping someone more knowledgeable can verify.

How does Spark Word2Vec merge each partition's results?

Answers (1)

Related Questions

How does Spark Word2Vec merge each partition&#39;s results?

Answers (1)

Related Questions

How does Spark Word2Vec merge each partition's results?