Reputation: 59
I have a huge amount of sentences (just a bit over 100,000). Each one contains on average 10 words. I am trying to put them together into one big list so I can us Counter
from the collections
library to show me the frequency each word occurs. What I'm doing currently is this:
from collections import Counter
words = []
for sentence in sentenceList:
words = words + sentence.split()
counts = Counter(words)
I was wondering if there is a way to do the same thing more efficiently. I've been waiting almost an hour now for this code to finish executing. I would think the concatenating is what is making this take so long since if I replace the line words = words + sentence.split()
with print(sentence.split())
it finishes executing in seconds. Any help would be much appreciated.
Upvotes: 1
Views: 96
Reputation: 16623
You can use extend
:
from collections import Counter
words = []
for sentence in sentenceList:
words.extend(sentence.split())
counts = Counter(words)
Or, a list comprehension like so:
words = [word for sentence in sentenceList for word in sentence.split()]
If you don't need words
later, you can pass a generator into Counter
:
counts = Counter(word for sentence in sentenceList for word in sentence.split())
Upvotes: 2
Reputation: 106891
Don't build a big, memory-hogging list if all you want to do is to count the elements. Keep updating the Counter
object with new iterables instead:
counts = Counter()
for sentence in sentenceList:
counts.update(sentence.split())
Upvotes: 3