BitByBit
BitByBit

Reputation: 567

How to find if a frequent word is concentrated in a specific part of text or evenly distributed?

I need to find out if, for example, the word stackoverflow is concentrated in a specific part of a text/string or more or less evenly distributed through out the text/string?

concentratedString = 'one stackoverflow two stackoverflow three four five stackoverflow six seven eight nine ten eleven twelve thirteen fourteen fifteen sixteen seventeen eighteen nineteen stackoverflow twenety twentyone twentytwo tentythree twentyfour twentyfive ...'

evenlydisString = 'one two three four five six stackoverflow seven eight nine ten eleven twelve stackoverflow thirteen fourteen fifteen sixteen stackoverflow seventeen eighteen nineteen twenety twentyone twentytwo tentythree twentyfour twentyfive stackoverflow...'

As can be noted, concentratedString has the word in question mostly in the first quadrant. Where as evenlydisString has the the word evenly distributed.

Is there a tool, preferable in python, that has this solution. If not, how would you go about this? 1) Loop trough the string, 2)Note postions, 3)Calculate distance between positions!

Upvotes: 0

Views: 50

Answers (1)

Samwise
Samwise

Reputation: 71517

Seems like you'd want to start by getting the positions of the matching word in the text, and then apply a function to the list of those positions whose value will depend on how similar or dispersed they are. Maybe use stdev as a starting point?

>>> from statistics import stdev
>>> def word_concentration(word: str, text: str) -> float:
...     return stdev(i for i, w in enumerate(text.split()) if w == word)
...
>>> word_concentration("stackoverflow", concentratedString)
9.5
>>> word_concentration("stackoverflow", evenlydisString)
6.027713773341708

Upvotes: 1

Related Questions