hari
hari

Reputation: 251

Interpretation of Spark MLLib LDA results

I ran LDA on spark for a set of documents and observed that the values of topicMatrix, which represents the topic distribution over terms, are more than 1 like 548.2201, 685.2436, 138.4013... What does these values mean? Are these the logarithmic values of the distribution or something. How to convert these values to probability distribution values. Thanks in advance.

Upvotes: 4

Views: 1121

Answers (2)

Jason Scott Lenderman
Jason Scott Lenderman

Reputation: 1918

In both models (i.e. DistributedLDAModel and LocalLDAMoel) the topicsMatrix method will, I believe, return (approximately, there's a bit of regularization due to the Dirichlet prior on topics) the expected word-topic count matrix. To check this you can take that matrix and sum up all the columns. The resulting vector (of length topic-count-size) should be approximately equal to the word counts (over all your documents.) In any case, to obtain the topics (probability distributions over words in your dictionary) you need to normalize the columns of the matrix returned by topicsMatrix so that each sums to 1.

I haven't tested it out fully, but something like this should work to normalize the columns of the matrix returned by topicsMatrix:

import breeze.linalg.{DenseVector => BDV}
import org.apache.spark.mllib.linalg._

def normalizeColumns(m: Matrix): DenseMatrix = {
  val bm = Matrices.toBreeze(m).toDenseMatrix
  val columnSums = BDV.zeros[Double](bm.cols).t
  var i = bm.rows
  while (i > 0) { i -= 1; columnSums += bm(i, ::) }
  i = bm.cols
  while (i > 0) { i -= 1; bm(::, i) /= columnSums(i) }
  new DenseMatrix(bm.rows, bm.cols, bm.data)
} 

Upvotes: 4

petertc
petertc

Reputation: 3911

Normalize the columns of the matrix returned by topicsMatrix in pure scala

def formatSparkLDAWordOutput(wordTopMat: Matrix, wordMap: Map[Int, String]): scala.Predef.Map[String, Array[Double]] = {

// incoming word top matrix is in column-major order and the columns are unnormalized
val m = wordTopMat.numRows
val n = wordTopMat.numCols
val columnSums: Array[Double] = Range(0, n).map(j => (Range(0, m).map(i => wordTopMat(i, j)).sum)).toArray

val wordProbs: Seq[Array[Double]] = wordTopMat.transpose.toArray.grouped(n).toSeq
  .map(unnormProbs => unnormProbs.zipWithIndex.map({ case (u, j) => u / columnSums(j) }))

wordProbs.zipWithIndex.map({ case (topicProbs, wordInd) => (wordMap(wordInd), topicProbs) }).toMap

}

https://github.com/apache/incubator-spot/blob/v1.0-incubating/spot-ml/src/main/scala/org/apache/spot/lda/SpotLDAWrapper.scala#L237

Upvotes: 0

Related Questions