Benz M.
Benz M.

Reputation: 25

Topics proportion over time using Mallet LDA

I would like to know how to train mallet LDA by sentences from 130 .txt files (monthly data) in my corpus. As the problem that I face when I estimate by document level is that, the plot of topics proportion overtime is so weird. For example, as the time passes, the proportion is still not vary and in some topics the proportion does not change.

enter image description here

Here is the coding script I use.

dir <- "C:/Users/Dell/desktop/MPSCLEANED"
setwd(dir)
require(mallet)

documents <- mallet.read.dir(dir) 
mallet.instances <- mallet.import(documents$id, documents$text, 
"C:/Users/Dell/desktop/stopwords.txt", token.regexp = "\\p{L}
[\\p{L}\\p{P}]+\\p{L}")

# Before moving on, I just wonder how can I estimate LDA by sentences from 
all documents in my corpus. 

n.topics <- 15
topic.model  <- MalletLDA(n.topics, alpha.sum=3.33, beta=0.2)
topic.model$model$setRandomSeed(19820L)
topic.model$setOptimizeInterval(50L)
topic.model$loadDocuments(mallet.instances)
vocabulary <- topic.model$getVocabulary()
word.freqs <- mallet.word.freqs(topic.model)
topic.model$setAlphaOptimization(1,1000)
topic.model$train(1000)
topic.model$maximize(20)

doc.topics <- mallet.doc.topics(topic.model, smoothed=T, normalized=T)
topic.words <- mallet.topic.words(topic.model, smoothed=T, normalized=T)

topic.docs <- t(doc.topics)
topic.docs <- topic.docs / rowSums(topic.docs)


topics.labels <- rep("", n.topics)
for (topic in 1:n.topics) topics.labels[topic] <- 
paste(mallet.top.words(topic.model, topic.words[topic,], num.top.words=5) 
$words, collapse=" ")

topics.labels

#Topics over time
options(java.parameters="-Xmx2g")
library("dfrtopics")
library("dplyr")
library("ggplot2")
library("lubridate")
library("stringr")
library("mallet")

m <- mallet_model(doc_topics = doc.topics, doc_ids = documents$id, vocab = 
vocabulary, topic_words = topic.words, model = topic.model)

pd <- data.frame(date = list.files(path = "C:/Users/Dell/Desktop/MPS"))

pd <- data.frame(date = lapply(pd, function(x) {gsub(".txt", "", x)}))

meta <- data.frame(id = documents$id, pubdate = as.Date(pd$date, "%Y%m%d"))

metadata(m) <- meta

# Visualize topics over time
theme_update(strip.text=element_text(size=7), 
axis.text=element_text(size=7))
topic_series(m) %>%
plot_series(labels=topic_labels(m, 2))

Upvotes: 1

Views: 712

Answers (1)

David Mimno
David Mimno

Reputation: 1901

130 documents isn't very much for estimating a topic model. Can the documents be subdivided into smaller segments?

Upvotes: 1

Related Questions