Interpretation of Spark MLLib LDA results

Question

I ran LDA on spark for a set of documents and observed that the values of topicMatrix, which represents the topic distribution over terms, are more than 1 like 548.2201, 685.2436, 138.4013... What does these values mean? Are these the logarithmic values of the distribution or something. How to convert these values to probability distribution values. Thanks in advance.

Jason Scott Lenderman · Accepted Answer

In both models (i.e. DistributedLDAModel and LocalLDAMoel) the topicsMatrix method will, I believe, return (approximately, there's a bit of regularization due to the Dirichlet prior on topics) the expected word-topic count matrix. To check this you can take that matrix and sum up all the columns. The resulting vector (of length topic-count-size) should be approximately equal to the word counts (over all your documents.) In any case, to obtain the topics (probability distributions over words in your dictionary) you need to normalize the columns of the matrix returned by topicsMatrix so that each sums to 1.

I haven't tested it out fully, but something like this should work to normalize the columns of the matrix returned by topicsMatrix:

import breeze.linalg.{DenseVector => BDV}
import org.apache.spark.mllib.linalg._

def normalizeColumns(m: Matrix): DenseMatrix = {
  val bm = Matrices.toBreeze(m).toDenseMatrix
  val columnSums = BDV.zeros[Double](bm.cols).t
  var i = bm.rows
  while (i > 0) { i -= 1; columnSums += bm(i, ::) }
  i = bm.cols
  while (i > 0) { i -= 1; bm(::, i) /= columnSums(i) }
  new DenseMatrix(bm.rows, bm.cols, bm.data)
}

Interpretation of Spark MLLib LDA results

Answers (2)

Related Questions