CrashingWater
CrashingWater

Reputation: 181

Keep most frequently occurring values in python list

I am creating a bag of words from a text corpus and am trying to limit the size of my vocabulary because the program freezes when I try to convert my list to a pandas dataframe. I am using Counter to count the number occurrences of each word:

from collections import Counter
bow = []
# corpus is list of text samples where each text sample is a list of words with variable length
for tokenized_text in corpus:
    clean_text = [tok.lower() for tok in tokenized_text if tok not in punctuation and tok not in stopwords]
    bow.append(Counter(clean_text))
# Program freezes here
df_bows = pd.DataFrame.from_dict(bow)

My input would be a list of tokens of length num_samples where each text sample is a list of tokens. For my output I want a pandas DataFrame with shape (num_samples, 10000) where 10000 is the size of my vocabulary. Before, my df_bows vocabulary size (df_bows.shape[1]) would get very large (greater than 50,000. How can I choose the 10,000 most frequently occurring words from my bow list of Counter objects and place then in a DataFrame while preserving number of text samples?

Upvotes: 3

Views: 821

Answers (3)

Ami Tavory
Ami Tavory

Reputation: 76297

To find the overall top 10000 words, the easiest way would be update a global Counter:

from collections import Counter
global_counter = Counter() # <- create a counter
for tokenized_text in corpus:
    clean_text = [tok.lower() for tok in tokenized_text if tok not in punctuation and tok not in stopwords]
    global_counter.update(clean_text) # <- update it

At this point, you could just use

import pandas as pd
df = pd.DataFrame(global_counter.most_common(10000))

If you would like to find the count of the words for the specific entries, add now the following code (after the previous one).

most_common = set([t[0] for t in global_counter.most_common(10000)])
occurrences = []
for tokenized_text in corpus:
    clean_text = dict(collections.Counter([tok.lower() for tok in tokenized_text if tok not in punctuation and tok not in stopwords]))
    occurrences.append({c: clean_text.get(c, 0) for c in most_common})

Now just use

pd.DataFrame(occurrences)

Upvotes: 3

Malik Faiq
Malik Faiq

Reputation: 442

you can most frequently occuring words by using counter most_comman helping function:

from collections import Counter
clean_text = [tok.lower() for tok in tokenized_text if tok not in punctuation and tok not in stopwords]
counter = Counter(clean_text)
counter.most_common(10000)

Upvotes: 0

BcK
BcK

Reputation: 2811

Counter.most_common(n) returns you the most common n elements.

Here : https://docs.python.org/3/library/collections.html#collections.Counter.most_common

from collections import Counter

myStr = "It was a very, very good presentation, was it not?"
C = Counter(myStr.split())
C.most_common(2)

# [('was', 2), ('It', 1)]

Upvotes: 0

Related Questions