Reputation: 73
I have a bunch of text, where I need to find the probaility of each letter in the texts, excluding punctuations and numbers. I calculated the count of each letter and formed that into dictionary:
import re
def tokenize(string):
return re.compile('\w+').findall(string)
from collections import Counter
def word_freq(string):
text = tokenize(string.lower())
str1 = ''.join(str(e) for e in text)
return dict(Counter(str1))
But, I want to find the probability of each letter. Can anyone tell me how to update the values in the dictionary alone?
Upvotes: 1
Views: 3041
Reputation: 51653
Starting with yours, you can count words or letters. Return them as dict. Them sum up how many in total and how many of that sort happend && print:
import re
def tokenize(string):
return re.compile('\w+').findall(string)
from collections import Counter
def word_freq(string):
text = tokenize(string.lower())
c = Counter(text) # count the words
d = Counter(''.join(text)) # count all letters
return (dict(c),dict(d)) # return a tuple of counted words and letters
data = "This is a text, it contains dupes and more dupes and dupes of dupes and lkkk."
words, letters = word_freq(data) # count and get dicts with counts
sumWords = sum(words.values()) # sum total words
sumLetters = sum(letters.values()) # sum total letters
# calc / print probability of word
for w in words:
print("Probability of '{}': {}".format(w,words[w]/sumWords))
# calc / print probability of letter
for l in letters:
print("Probability of '{}': {}".format(l,letters[l]/sumLetters))
To modify your dict with the probabilites simply replace the dicts values with the calculated probability:
# update the counts to propabilities:
for w in words:
words[w] = words[w]/sumWords
print ( words)
Output:
# words
Probability of 'this': 0.0625
Probability of 'is': 0.0625
Probability of 'a': 0.0625
Probability of 'text': 0.0625
Probability of 'it': 0.0625
Probability of 'contains': 0.0625
Probability of 'dupes': 0.25
Probability of 'and': 0.1875
Probability of 'more': 0.0625
Probability of 'of': 0.0625
Probability of 'lkkk': 0.0625
# letters
Probability of 't': 0.08333333333333333
Probability of 'h': 0.016666666666666666
Probability of 'i': 0.06666666666666667
# [....] snipped some for brevity
Probability of 'f': 0.016666666666666666
Probability of 'l': 0.016666666666666666
Probability of 'k': 0.05
After recomputing values of words
:
{'this': 0.0625, 'is': 0.0625, 'a': 0.0625, 'text': 0.0625, 'it': 0.0625,
'contains': 0.0625, 'dupes': 0.25, 'and': 0.1875, 'more': 0.0625,
'of': 0.0625, 'lkkk': 0.0625}
Upvotes: 1