Reputation: 181
I am creating a bag of words from a text corpus and am trying to limit the size of my vocabulary because the program freezes when I try to convert my list to a pandas dataframe. I am using Counter to count the number occurrences of each word:
from collections import Counter
bow = []
# corpus is list of text samples where each text sample is a list of words with variable length
for tokenized_text in corpus:
clean_text = [tok.lower() for tok in tokenized_text if tok not in punctuation and tok not in stopwords]
bow.append(Counter(clean_text))
# Program freezes here
df_bows = pd.DataFrame.from_dict(bow)
My input would be a list of tokens of length num_samples where each text sample is a list of tokens. For my output I want a pandas DataFrame with shape (num_samples, 10000) where 10000 is the size of my vocabulary. Before, my df_bows
vocabulary size (df_bows.shape[1]
) would get very large (greater than 50,000.
How can I choose the 10,000 most frequently occurring words from my bow
list of Counter objects and place then in a DataFrame while preserving number of text samples?
Upvotes: 3
Views: 821
Reputation: 76297
To find the overall top 10000 words, the easiest way would be update
a global Counter
:
from collections import Counter
global_counter = Counter() # <- create a counter
for tokenized_text in corpus:
clean_text = [tok.lower() for tok in tokenized_text if tok not in punctuation and tok not in stopwords]
global_counter.update(clean_text) # <- update it
At this point, you could just use
import pandas as pd
df = pd.DataFrame(global_counter.most_common(10000))
If you would like to find the count of the words for the specific entries, add now the following code (after the previous one).
most_common = set([t[0] for t in global_counter.most_common(10000)])
occurrences = []
for tokenized_text in corpus:
clean_text = dict(collections.Counter([tok.lower() for tok in tokenized_text if tok not in punctuation and tok not in stopwords]))
occurrences.append({c: clean_text.get(c, 0) for c in most_common})
Now just use
pd.DataFrame(occurrences)
Upvotes: 3
Reputation: 442
you can most frequently occuring words by using counter most_comman helping function:
from collections import Counter
clean_text = [tok.lower() for tok in tokenized_text if tok not in punctuation and tok not in stopwords]
counter = Counter(clean_text)
counter.most_common(10000)
Upvotes: 0
Reputation: 2811
Counter.most_common(n)
returns you the most common n elements.
Here : https://docs.python.org/3/library/collections.html#collections.Counter.most_common
from collections import Counter
myStr = "It was a very, very good presentation, was it not?"
C = Counter(myStr.split())
C.most_common(2)
# [('was', 2), ('It', 1)]
Upvotes: 0