Handling Memory Error when dealing with really large number of words (100 million) for LDA analysis

Question

I have 50,000k files - that have a combined total of 162 million words. I wanted to do topic modelling using Gensim similar to this tutorial here

So, LDA requires one to tokenize the documents into words and then create a word frequency dictionary.

So, I have these files read into a pandas dataframe (The 'content' column has the text) and do the following to create a list of the texts.image of dataframe attached here

texts = [[word for word in row[1]['content'].lower().split() if word not in stopwords] for row in df.iterrows()]

However, I have been running into a memory error, because of the large word count.

I also tried the TokenVectorizer in Python. I had got a memory error for this too.

def simple_tokenizer(str_input):
    words = re.sub(r"[^A-Za-z0-9\-]", " ", str_input).lower().split()
    return words

vectorizer = TfidfVectorizer(use_idf=True, tokenizer=simple_tokenizer, stop_words='english')
X = vectorizer.fit_transform(df['content'])

How do I handle tokenizing these really long documents in a way it can be processed for LDA Analysis?

I have an i7, 16GB Desktop if that matters.

EDIT

Since Python was unable to store really large lists. I actually rewrote the code, to read each file (originally stored as HTML), convert it to text, create a text vector, append it to a list, and then sent it to the LDA code. It worked!

Handling Memory Error when dealing with really large number of words (>100 million) for LDA analysis

Answers (1)

Related Questions

Handling Memory Error when dealing with really large number of words (&gt;100 million) for LDA analysis

Answers (1)

Related Questions

Handling Memory Error when dealing with really large number of words (>100 million) for LDA analysis