Reputation: 251
I ran LDA on spark for a set of documents and observed that the values of topicMatrix, which represents the topic distribution over terms, are more than 1 like 548.2201, 685.2436, 138.4013... What does these values mean? Are these the logarithmic values of the distribution or something. How to convert these values to probability distribution values. Thanks in advance.
Upvotes: 4
Views: 1121
Reputation: 1918
In both models (i.e. DistributedLDAModel
and LocalLDAMoel
) the topicsMatrix
method will, I believe, return (approximately, there's a bit of regularization due to the Dirichlet prior on topics) the expected word-topic count matrix. To check this you can take that matrix and sum up all the columns. The resulting vector (of length topic-count-size) should be approximately equal to the word counts (over all your documents.) In any case, to obtain the topics (probability distributions over words in your dictionary) you need to normalize the columns of the matrix returned by topicsMatrix
so that each sums to 1.
I haven't tested it out fully, but something like this should work to normalize the columns of the matrix returned by topicsMatrix
:
import breeze.linalg.{DenseVector => BDV}
import org.apache.spark.mllib.linalg._
def normalizeColumns(m: Matrix): DenseMatrix = {
val bm = Matrices.toBreeze(m).toDenseMatrix
val columnSums = BDV.zeros[Double](bm.cols).t
var i = bm.rows
while (i > 0) { i -= 1; columnSums += bm(i, ::) }
i = bm.cols
while (i > 0) { i -= 1; bm(::, i) /= columnSums(i) }
new DenseMatrix(bm.rows, bm.cols, bm.data)
}
Upvotes: 4
Reputation: 3911
Normalize the columns of the matrix returned by topicsMatrix in pure scala
def formatSparkLDAWordOutput(wordTopMat: Matrix, wordMap: Map[Int, String]): scala.Predef.Map[String, Array[Double]] = {
// incoming word top matrix is in column-major order and the columns are unnormalized
val m = wordTopMat.numRows
val n = wordTopMat.numCols
val columnSums: Array[Double] = Range(0, n).map(j => (Range(0, m).map(i => wordTopMat(i, j)).sum)).toArray
val wordProbs: Seq[Array[Double]] = wordTopMat.transpose.toArray.grouped(n).toSeq
.map(unnormProbs => unnormProbs.zipWithIndex.map({ case (u, j) => u / columnSums(j) }))
wordProbs.zipWithIndex.map({ case (topicProbs, wordInd) => (wordMap(wordInd), topicProbs) }).toMap
}
Upvotes: 0