Khachatur Mirijanyan
Khachatur Mirijanyan

Reputation: 435

Topic Modeling Memory Error: How to do gensim topic modelling when with large amounts of data

I'm having an issue topic modeling with a lot of data. I am trying to do both LDA and NMF topic modeling which I have done before, but not with the great volume of data I am currently working with. The main issue is that i can't hold all my data in memory while also creating the models.

I need both the models and associated metrics. Here is the code for how i make my models currently

def make_lda(dictionary, corpus, num_topics):
    passes = 3

    # Make a index to word dictionary.
    temp = dictionary[0]  # This is only to "load" the dictionary.
    id2word = dictionary.id2token

    model = LdaMulticore(
        corpus=corpus,
        id2word=id2word,
        passes=passes,
        num_topics=num_topics
    )
    
    return model

def make_nmf(dictionary, corpus, num_topics):
    
    passes = 3

    # Make a index to word dictionary.
    temp = dictionary[0]  # This is only to "load" the dictionary.
    id2word = dictionary.id2token
    
    model = Nmf(
        corpus=corpus,
        id2word=id2word,
        passes=passes,
        num_topics=num_topics
    )
    
    return model

And here is how I get the coherence measures and some other statistics

def get_model_stats(model, model_type, docs, dictionary, corpus, num_topics, verbose=False, get_topics=False):
    if model_type == 'lda':
        top_topics = model.top_topics(texts=docs, dictionary=dictionary, coherence='c_v') #, num_words=20)
    elif model_type == 'nmf':
        top_topics = model.top_topics(corpus=corpus, texts=docs, dictionary=dictionary, coherence='c_v') #, num_words=20)

    # Average topic coherence is the sum of topic coherences of all topics, divided by the number of topics.
    avg_topic_coherence = sum([t[1] for t in top_topics]) / num_topics
    rstd_atc = np.std([t[1] for t in top_topics]) / avg_topic_coherence
  
    if verbose:
        print('Average topic coherence: ', avg_topic_coherence)
        print('Relative Standard Deviation of ATC: ', rstd_atc)
    
    if get_topics:
        return avg_topic_coherence, rstd_atc, top_topics
    
    return avg_topic_coherence, rstd_atc

As you can see, I need my dictionary, texts, corpus, and id2token objects in memory at different times, sometimes all at the same time. But I can't do that since something like my texts use up a ton of memory. My machine just does not have enough.

I know I can pay to get a virtual machine with crazy amounts of RAM, but I want to know if there is a better solution. I can store all of my data on disk. Is there a way to run these models were the data is not in memory? Is there some other solution where I don't overload my memory?

Upvotes: 0

Views: 828

Answers (2)

gojomo
gojomo

Reputation: 54143

You don't show how your corpus (or docs/texts) is created, but the single most important thing to remember with Gensim is that entire training sets essentially never have to be in-memory at once (as with a giant list).

Rather, you can (& for any large corpus when memory is a possible issue should) provide it as a re-iterable Python sequence, that only reads individual items from underlying storage as requested. Using a Python generator is usually a key part (but the not the whole story) of such an approach.

The original creator of the Gensim package has a blog post going over the basics: "Data streaming in Python: generators, iterators, iterables"

Upvotes: 1

sophros
sophros

Reputation: 16620

There are some small tweaks that you can potentially use that will likely do not make much difference (e.g. changing lists comprehensions into generators - e.g. when summing up) but this is a general memory-saving hint so I thought it is worth mentioning it.

Out of notable differences you can get is to use some more aggressive pruning on the Dictionary. The default parameter is to prune_at=200000. You may want to lower the threshold to some lower value if you have plenty of documents.

Another thing to do is to apply filter_extremes function to the created dictionary to remove words that are unlikely to have influence on the results. Here you can set up the parameters more aggressively again:

no_below – Keep tokens which are contained in at least no_below documents.

no_above – Keep tokens which are contained in no more than no_above documents (fraction of total corpus size, not an absolute number).

keep_n – Keep only the first keep_n most frequent tokens.

On top of that you may want to call garbage collector every once in a while (e.g. before running make_nmf function):

import gc
gc.collect()

And for sure do not run make_nmf and make_lda in parallel (you are probably not doing that but I wanted to highlight it because we do not see your whole code).

Tweaking these values can help you reduce the memory footprint desired and maintain the best possible model.

Upvotes: 0

Related Questions