Reputation: 1378

Latent Dirichlet allocation(LDA) performance by limiting word size for Corpus Documents

I have been generating topics with yelp data set of customer reviews by using Latent Dirichlet allocation(LDA) in python(gensim package). While generating tokens, I am selecting only the words having length >= 3 from the reviews( By using RegexpTokenizer):

from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w{3,}')
tokens = tokenizer.tokenize(review)

This will allow us to filter out the noisy words of length less than 3, while creating the corpus document.

How will filtering out these words effect performance with the LDA algorithm?

Upvotes: 0

Answers (3)

dorriz

Reputation: 2689

You could use the nltk library for this task I think, something like :

def remove_words(tokens):
    stopwords = nltk.corpus.stopwords.words(
        "english"
    )  # also supports german, spanish, portuguese, and others!
    stopwords = [
        remove_unwanted(word) for word in stopwords
    ]  # remove puntcuation from stopwords
    cleaned_tokens = [token for token in tokens if token not in stopwords]
    return cleaned_tokens

Upvotes: 0

Sara

Reputation: 1212

Words less than length 3 are considered stop words. LDAs build topics so imagine you generate this topic:

[I, him, her, they, we, and, or, to]

compared to:

[shark, bull, greatwhite, hammerhead, whaleshark]

Which is more telling? This is why it is important to remove stopwords. This is how I do that:

# Create functions to lemmatize stem, and preprocess

# turn beautiful, beautifuly, beautified into stem beauti 
def lemmatize_stemming(text):
    stemmer = PorterStemmer()
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

# parse docs into individual words ignoring words that are less than 3 letters long
# and stopwords: him, her, them, for, there, ect since "their" is not a topic.
# then append the tolkens into a list
def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        newStopWords = ['your_stopword1', 'your_stopword2']
        if token not in gensim.parsing.preprocessing.STOPWORDS and token not in newStopWords and len(token) > 3:
            nltk.bigrams(token)
            result.append(lemmatize_stemming(token))
    return result

Upvotes: 0

Brian O'Donnell

Reputation: 1886

Generally speaking, for the English language, one and two letter words don't add information about the topic. If they don't add value they should be removed during the pre-processing step. Like most algorithms, less data in will speed up the execution time.

Upvotes: 0

Latent Dirichlet allocation(LDA) performance by limiting word size for Corpus Documents

Answers (3)

Related Questions