Glorifier
Glorifier

Reputation: 31

Stemming and lemmatizing - What approach?

I am preparing to do topic modeling via Mallet and have finished pulling the raw datasets. Before I import and start modeling, I need to take some steps to clean and streamline the texts, of course. I have my lists of stopwwords ready and I know that I can remove punctuation, digits etc. easily with Excel. What I am a little fuzzy about is stemming and lemmatizing. Not on the concept itself but rather what the best approach would be.

To give a better overview, here is what I would like to do:

Based on experience, can anyone recommend the best approach to these three, especially to the last though? Is there an app that I can use for that?

Many thanks in advance!

Upvotes: 2

Views: 1217

Answers (1)

David Mimno
David Mimno

Reputation: 1901

See what happens without these interventions first.

Whitespace and punctuation usually isn't a problem, but you might want to make sure that text contains no tabs or newlines, as these can confuse the data import functions. There is a common problem when importing into something like excel that pays attention to quotation marks, where it can interpret many lines as a single document if quotes aren't matched.

The problem with stemming, lemmatization, and spelling regularization is that they have the same objective as the topic model itself. Its goal is to combine semantically similar words based on context, so it actually doesn't have a problem with the kind of variation you see in English. For other languages with lots of morphology you might need something more advanced. But in most cases you are really just making the model's job harder.

One way to use a stemmer is to stem after modeling. People often think they need stemmers because they see multiple small variants of a word in the model output. I would argue that this is a sign the model is working, but I can see that it might not be the best interface. In this case you might notice that certain words map to the same stem, and only show the original form of the most frequent one.

In my experience the most effective interventions you can make are cleaning up problems in input (like hyphen- ated words) and converting significant multi-word terms to single terms (like topic modeling to topic_modeling).

Upvotes: 2

Related Questions