Reputation: 25
I would like to know how to train mallet LDA by sentences from 130 .txt files (monthly data) in my corpus. As the problem that I face when I estimate by document level is that, the plot of topics proportion overtime is so weird. For example, as the time passes, the proportion is still not vary and in some topics the proportion does not change.
Here is the coding script I use.
dir <- "C:/Users/Dell/desktop/MPSCLEANED"
setwd(dir)
require(mallet)
documents <- mallet.read.dir(dir)
mallet.instances <- mallet.import(documents$id, documents$text,
"C:/Users/Dell/desktop/stopwords.txt", token.regexp = "\\p{L}
[\\p{L}\\p{P}]+\\p{L}")
# Before moving on, I just wonder how can I estimate LDA by sentences from
all documents in my corpus.
n.topics <- 15
topic.model <- MalletLDA(n.topics, alpha.sum=3.33, beta=0.2)
topic.model$model$setRandomSeed(19820L)
topic.model$setOptimizeInterval(50L)
topic.model$loadDocuments(mallet.instances)
vocabulary <- topic.model$getVocabulary()
word.freqs <- mallet.word.freqs(topic.model)
topic.model$setAlphaOptimization(1,1000)
topic.model$train(1000)
topic.model$maximize(20)
doc.topics <- mallet.doc.topics(topic.model, smoothed=T, normalized=T)
topic.words <- mallet.topic.words(topic.model, smoothed=T, normalized=T)
topic.docs <- t(doc.topics)
topic.docs <- topic.docs / rowSums(topic.docs)
topics.labels <- rep("", n.topics)
for (topic in 1:n.topics) topics.labels[topic] <-
paste(mallet.top.words(topic.model, topic.words[topic,], num.top.words=5)
$words, collapse=" ")
topics.labels
#Topics over time
options(java.parameters="-Xmx2g")
library("dfrtopics")
library("dplyr")
library("ggplot2")
library("lubridate")
library("stringr")
library("mallet")
m <- mallet_model(doc_topics = doc.topics, doc_ids = documents$id, vocab =
vocabulary, topic_words = topic.words, model = topic.model)
pd <- data.frame(date = list.files(path = "C:/Users/Dell/Desktop/MPS"))
pd <- data.frame(date = lapply(pd, function(x) {gsub(".txt", "", x)}))
meta <- data.frame(id = documents$id, pubdate = as.Date(pd$date, "%Y%m%d"))
metadata(m) <- meta
# Visualize topics over time
theme_update(strip.text=element_text(size=7),
axis.text=element_text(size=7))
topic_series(m) %>%
plot_series(labels=topic_labels(m, 2))
Upvotes: 1
Views: 712
Reputation: 1901
130 documents isn't very much for estimating a topic model. Can the documents be subdivided into smaller segments?
Upvotes: 1