Reputation:
I am doing Latent Dirichlet Analyses for some research and keep running into a problem. Most lda software requires documents to be in doclines format, meaning a CSV or other delimited file in which each line represents the entirety of a document. However, Blei's lda-c and dynamic topic model software requires that data be in the format: [M] [term_1]:[count] [term_2]:[count] ... [term_N]:[count]
where [M]
is the number of unique terms in the document, and the [count] associated with each term is how many times that term appeared
in the document. Note that [term_1]
is an integer which indexes the
term; it is not a string.
Does anyone know of a utility that will let me quickly convert to this format? Thank you.
Upvotes: 5
Views: 3308
Reputation: 799
For Python, there is an available function for this(may not be available at the time of the question).
lda.utils.dtm2ldac
The document is https://pythonhosted.org/lda/api.html#module-lda.utils
Upvotes: 0
Reputation: 42313
If you are working with R
, the lda
package contains a function lexicalize
that will convert raw text into the lda-c format necessary for the lda
package.
example <- c("I am the very model of a modern major general",
"I have a major headache")
corpus <- lexicalize(example, lower=TRUE)
Similarly, the topicmodels
package has a function dtm2ldaformat
that will convert a document term matrix to the lda format. You can convert a plain text document into a document term matrix using the tm
package, also in R
.
So with these existing functions there's a lot of flexibility in getting text into R
for topic modelling.
Upvotes: 3
Reputation: 211
The Mallet package from University of Massachusetts Amherst is another option.
And here is an excellent step-by-step demo on how to use Mallet:
You can use mallet with just normal text files as input source.
Upvotes: 2