user836015
user836015

Reputation:

Convert one-document-per-line to Blei's lda-c/dtm format for topic modeling?

I am doing Latent Dirichlet Analyses for some research and keep running into a problem. Most lda software requires documents to be in doclines format, meaning a CSV or other delimited file in which each line represents the entirety of a document. However, Blei's lda-c and dynamic topic model software requires that data be in the format: [M] [term_1]:[count] [term_2]:[count] ... [term_N]:[count] where [M] is the number of unique terms in the document, and the [count] associated with each term is how many times that term appeared in the document. Note that [term_1] is an integer which indexes the term; it is not a string.

Does anyone know of a utility that will let me quickly convert to this format? Thank you.

Upvotes: 5

Views: 3308

Answers (4)

Lei Hao
Lei Hao

Reputation: 799

For Python, there is an available function for this(may not be available at the time of the question).

lda.utils.dtm2ldac

The document is https://pythonhosted.org/lda/api.html#module-lda.utils

Upvotes: 0

Ben
Ben

Reputation: 42313

If you are working with R, the lda package contains a function lexicalize that will convert raw text into the lda-c format necessary for the lda package.

example <- c("I am the very model of a modern major general",
             "I have a major headache")

corpus <- lexicalize(example, lower=TRUE) 

Similarly, the topicmodels package has a function dtm2ldaformat that will convert a document term matrix to the lda format. You can convert a plain text document into a document term matrix using the tm package, also in R.

So with these existing functions there's a lot of flexibility in getting text into R for topic modelling.

Upvotes: 3

Mountain
Mountain

Reputation: 211

The Mallet package from University of Massachusetts Amherst is another option.

And here is an excellent step-by-step demo on how to use Mallet:

You can use mallet with just normal text files as input source.

Upvotes: 2

Karsten
Karsten

Reputation: 872

Gensim offers an implementation of Blei's corpus format. See here. You could write a quick corpus based on your CSV file in Python and then save it in lda-c with gensim. It should not be too hard.

Upvotes: 1

Related Questions