Gregg Lind
Gregg Lind

Reputation: 21270

Latent Dirichlet Allocation, pitfalls, tips and programs

I'm experimenting with Latent Dirichlet Allocation for topic disambiguation and assignment, and I'm looking for advice.

  1. Which program is the "best", where best is some combination of easiest to use, best prior estimation, fast
  2. How do I incorporate my intuitions about topicality. Let's say I think I know that some items in the corpus are really in the same category, like all articles by the same author. Can I add that into the analysis?
  3. Any unexpected pitfalls or tips I should know before embarking?

I'd prefer is there are R or Python front ends for whatever program, but I expect (and accept) that I'll be dealing with C.

Upvotes: 21

Views: 9702

Answers (6)

def plot_top_words(model, feature_names, n_top_words, title):
fig, axes = plt.subplots(2, 5, figsize=(30, 15), sharex=True)
axes = axes.flatten()
for topic_idx, topic in enumerate(model.components_):
    top_features_ind = topic.argsort()[:-n_top_words - 1:-1]
    top_features = [feature_names[i] for i in top_features_ind]
    weights = topic[top_features_ind]

    ax = axes[topic_idx]
    ax.barh(top_features, weights, height=0.7)
    ax.set_title(f'Topic {topic_idx +1}',
                 fontdict={'fontsize': 30})
    ax.invert_yaxis()
    ax.tick_params(axis='both', which='major', labelsize=20)
    for i in 'top right left'.split():
        ax.spines[i].set_visible(False)
    fig.suptitle(title, fontsize=40)

plt.subplots_adjust(top=0.90, bottom=0.05, wspace=0.90, hspace=0.3)
plt.show()

Upvotes: 0

Ben
Ben

Reputation: 42293

  1. You mentioned a preference for R, you can use two packages topicmodels (slow) or lda (fast). Python has deltaLDA, pyLDA, Gensim, etc.

  2. Topic modeling with specified topics or words is tricky out-of-the-box, David Andrzejewski has some Python code that seems to do it. There is a C++ implementation of supervised LDA here. And plenty of papers on related approaches (DiscLDA, Labeled LDA but not in an easy-to-use form, for me anyway...

  3. As @adi92 says, removing stopwords, white spaces, numbers, punctuation and stemming all improve things a lot. One possible pitfall is having the wrong (or an inappropriate) number of topics. Currently there are no straightforward diagnostics for how many topics are optimum for a coprus of a give size, etc. There are some measures of topic quality available in MALLET (fastest), which are very handy.

Upvotes: 6

goh
goh

Reputation: 29511

i second that. Mallet's lda uses a sparselda data structure and distributed learning, so its v fast. switching on hyperparameter optimization will give a better result, imo.

Upvotes: 0

Aditya Mukherji
Aditya Mukherji

Reputation: 9256

  1. http://mallet.cs.umass.edu/ is IMHO the most awesome plug-n-play LDA package out there.. It uses Gibbs sampling to estimate topics and has a really straightforward command-line interface with a lot of extra bells-n-whistles (a few more complicated models, hyper-parameter optimization, etc)

  2. Its best to let the algorithm do its job. There may be variants of LDA (and pLSI,etc) which let you do some sort of semi-supervised thing.. I don't know of any at the moment.

  3. I found removing stop-words and other really high-frequency words seemed to improve the quality of my topics a lot (evaluated by looking at top words of each topic, not any rigorous metric).. I am guessing stemming/lemmatization would help as well.

Upvotes: 17

eulerfx
eulerfx

Reputation: 37719

For this kind of analysis I have used LingPipe: http://alias-i.com/lingpipe/index.html. It is an open source Java library, parts of which I use directly or port. To incorporate your own data, you may use a classifier, such as naive bayes, in conjunction. my experiences with statistical nlp is limited, but it usually follows a cycle of setting up classifiers, training, and looking over results, tweaking.

Upvotes: 1

Gregg Lind
Gregg Lind

Reputation: 21270

In addition to the usual sources, it seems like the most active area talking about this is on the topics-models listserv. From my initial survey, the easiest package to understand is the LDA Matlab package.

This is not lightweight stuff at all, so I'm not surprised it's hard to find good resources on it.

Upvotes: 1

Related Questions