Kublakhan
Kublakhan

Reputation: 11

How can I find words that occur frequently across several different texts?

So, I'm trying to find words that crop up in a collection several texts. They don't necessarily have to be very frequent in any given text, or even across all the texts -- their frequency for any given text in the sample has to be roughly the same, that's all.

This seems fairly simple, but I haven't been able to find a clean and elegant way to do it -- the only idea that comes to mind is getting the frequency for all words in each given text (using something like this, say), turning those lists into dictionaries, and then getting every key where the range of values across all the dictionaries for that key is fairly low (like if the lowest value is within 25% of the highest, or whatever). That seems like it'd work, but it feels like such a kludge and this problem seems fairly common and banal, so I thought I'd ask if there's a better solution out there.

Upvotes: 1

Views: 61

Answers (1)

d-k-bo
d-k-bo

Reputation: 672

I think there is no better way than calculating the frequency of each word in each text

from collections import Counter
import re

for text in list_of_text_strings:
    word_list = re.findall(r'\w+', text)
    total_words = len(word_list)
    word_frequencies.append({
        word: count/total_words
        for word, count in Counter(word_list).items()
    })

and then creating a set of all words.

all_words = {word for text in word_frequencies for word in text}

To compare the frequency it might be the best way to calculate the standard deviation and create a dict of word-std pairs.

import math


def std(xs):
    return math.sqrt(sum((x-sum(xs) / len(xs))**2 for x in xs) / len(xs))


word_std_deviations = dict(sorted(
    (
        (word, std([text.get(word, 0) for text in word_frequencies]))
        for word in all_words
    ),
    key=lambda x: x[1]
))

Upvotes: 1

Related Questions