user10046100
user10046100

Reputation: 85

Handling Memory Error when dealing with really large number of words (>100 million) for LDA analysis

I have 50,000k files - that have a combined total of 162 million words. I wanted to do topic modelling using Gensim similar to this tutorial here

So, LDA requires one to tokenize the documents into words and then create a word frequency dictionary.

So, I have these files read into a pandas dataframe (The 'content' column has the text) and do the following to create a list of the texts.image of dataframe attached here

texts = [[word for word in row[1]['content'].lower().split() if word not in stopwords] for row in df.iterrows()]

However, I have been running into a memory error, because of the large word count.

I also tried the TokenVectorizer in Python. I had got a memory error for this too.

def simple_tokenizer(str_input):
    words = re.sub(r"[^A-Za-z0-9\-]", " ", str_input).lower().split()
    return words
vectorizer = TfidfVectorizer(use_idf=True, tokenizer=simple_tokenizer, stop_words='english')
X = vectorizer.fit_transform(df['content'])

How do I handle tokenizing these really long documents in a way it can be processed for LDA Analysis?

I have an i7, 16GB Desktop if that matters.

EDIT

Since Python was unable to store really large lists. I actually rewrote the code, to read each file (originally stored as HTML), convert it to text, create a text vector, append it to a list, and then sent it to the LDA code. It worked!

Upvotes: 2

Views: 616

Answers (1)

arudzinska
arudzinska

Reputation: 3331

So, LDA requires one to tokenize the documents into words and then create a word frequency dictionary.

If the only output you need from this is a dictionary with the word count, I would do the following:

Process files one by one in a loop. This way you store only one file in memory. Process it, then move to the next one:

# for all files in your directory/directories:
with open(current_file, 'r') as f:
    for line in f:
        # your logic to update the dictionary with the word count

# here the file is closed and the loop moves to the next one

EDIT: When it comes to issues with keeping a really large dictionary in memory, you have to remember that Python reserves a lot of memory for keeping the dict low density - a price for a fast lookup possibilities. Therefore, you must search for another way of storing the key-value pairs, for e.g. a list of tuples, but the cost will be much slower lookup. This question is about that and has some nice alternatives described there.

Upvotes: 1

Related Questions