Reputation: 69
I have some news corpus, and I want use LDA to extract keywords for each news document, the keywords are also can be called as labels, indicates that what this news is all about.
Instead of using tf-idf
, I searched the Internet and think LDA can do this job better.
Refer to Spark Doc: https://spark.apache.org/docs/latest/mllib-clustering.html#latent-dirichlet-allocation-lda and I found that EMLDAOptimizer
produces a DistributedLDAModel
, which stores not only the inferred topics but also the full training corpus and topic distributions for each document in the training corpus.
A DistributedLDAModel supports:
while OnlineLDAOptimizer
produces a LocalLDAModel
, which only stores the inferred topics.
(from mllib)
I have some news corpus, and they are pure text, and size is 1.2G more or less. After tokenizing, removing stopwords and all the data cleaning procedures(pre-process). I use CountVectorize
with VocabSize
set to 200000 and LDA
with K
set to 15, maxiter
to 100, optimizer
to "em", CheckpointInterval
to 10, other parameters not mentioned are default values. These 2 Transformer
s are put in a Pipeline for training.
(from ml)
val countVectorizer = new CountVectorizer()
.setInputCol("content_clean_cut")
.setOutputCol("count_vector")
.setVocabSize(200000)
val lda = new LDA()
.setK(15)
.setMaxIter(100)
.setFeaturesCol("count_vector")
.setOptimizer("em")
.setCheckpointInterval(10)
val pipeline = new Pipeline()
.setStages(Array(countVectorizer, lda))
val ldaModel = pipeline.fit(newsDF)
ldaModel.write.overwrite().save("./news_lda.model")
Sending the job to spark-submit with about 300G memory, finally it trained successfully.
Then I began to use this pipeline model to transform the pre-processed news corpus, the show()
is:
+------------------------------+----+--------------------+--------------------+
| content_clean_cut| cls| count_vector| topicDistribution|
+------------------------------+----+--------------------+--------------------+
| [深锐, 观察, 科比, 只想, ...|体育|(200000,[4,6,9,11...|[0.02062984049807...|
| [首届, 银联, 网络, 围棋赛,...|体育|(200000,[2,4,7,9,...|[0.02003532045153...|
|[董希源, 国米, 必除, 害群之...|体育|(200000,[2,4,9,11...|[0.00729266918401...|
| [李晓霞, 破茧, 成蝶, 只差,...|体育|(200000,[2,4,7,13...|[0.01200369382233...|
| [深锐, 观察, 对手, 永远, ...|体育|(200000,[4,9,13,1...|[0.00613485655279...|
schema:
root
|-- content_clean_cut: array (nullable = true)
| |-- element: string (containsNull = true)
|-- cls: string (nullable = true)
|-- count_vector: vector (nullable = true)
|-- topicDistribution: vector (nullable = true)
I don't understand what is this topicDistribution
column mean, why is its length is K
, is that means the index of the largest number is the topic index of this document(news), so we can infer the topic of this document by finding the index of the largest number, and the index actually is the index of the topic in describeTopics()
method returned?
Cast the second stage in the pipeline to DistributedLDAModel
but failed to find anything related to topTopicsPerDocument
and topDocumentsPerTopic
. Why is this different from official documents?
And there is this method topicsMatrix
in the instance of DistributedLDAModel
, what the hell is this? I have done some research, think that topicsMatrix
is every topic times every vocab in countVectorizerModel.vocabulary
, I don't think this topicsMatrix
would help. Besides some of the numbers in this matrix are double greater than 1, and this makes me confused. But this is not important.
What is more important is how to use LDA to extract different keywords for each documents(news)?
Keyword extraction is tasked with the automatic identification of terms that best describe the subject of a document.
Upvotes: 0
Views: 883
Reputation: 11
K is the number of topics to be clustered with your news corpus. The topicDistribution
of each document is the array of K-topics probability (basically to tell you which topic index has the highest probability). You would then required to manually label the K-topics (based on the terms grouped under each topic), hence you are able to "label" the documents.
LDA not going to give you a "label" based on the text, instead it clustered the related keywords into the desired k-topics
Upvotes: 1