Reputation: 85
I have 50,000k files - that have a combined total of 162 million words. I wanted to do topic modelling using Gensim similar to this tutorial here
So, LDA requires one to tokenize the documents into words and then create a word frequency dictionary.
So, I have these files read into a pandas dataframe (The 'content' column has the text) and do the following to create a list of the texts.image of dataframe attached here
texts = [[word for word in row[1]['content'].lower().split() if word not in stopwords] for row in df.iterrows()]
However, I have been running into a memory error, because of the large word count.
I also tried the TokenVectorizer in Python. I had got a memory error for this too.
def simple_tokenizer(str_input):
words = re.sub(r"[^A-Za-z0-9\-]", " ", str_input).lower().split()
return words
vectorizer = TfidfVectorizer(use_idf=True, tokenizer=simple_tokenizer, stop_words='english')
X = vectorizer.fit_transform(df['content'])
How do I handle tokenizing these really long documents in a way it can be processed for LDA Analysis?
I have an i7, 16GB Desktop if that matters.
EDIT
Since Python was unable to store really large lists. I actually rewrote the code, to read each file (originally stored as HTML), convert it to text, create a text vector, append it to a list, and then sent it to the LDA code. It worked!
Upvotes: 2
Views: 616
Reputation: 3331
So, LDA requires one to tokenize the documents into words and then create a word frequency dictionary.
If the only output you need from this is a dictionary with the word count, I would do the following:
Process files one by one in a loop. This way you store only one file in memory. Process it, then move to the next one:
# for all files in your directory/directories:
with open(current_file, 'r') as f:
for line in f:
# your logic to update the dictionary with the word count
# here the file is closed and the loop moves to the next one
EDIT: When it comes to issues with keeping a really large dictionary in memory, you have to remember that Python reserves a lot of memory for keeping the dict
low density - a price for a fast lookup possibilities. Therefore, you must search for another way of storing the key-value pairs, for e.g. a list of tuples, but the cost will be much slower lookup. This question is about that and has some nice alternatives described there.
Upvotes: 1