Henry Crow
Henry Crow

Reputation: 23

word frequencies in text file in python

I want to find frequencies for the certain words in wanted, and while it finds me the frequecies, the displayed result contains lots of unnecessary data.

Code:

from collections import Counter
import re
wanted = "whereby also thus"
cnt = Counter()
words = re.findall('\w+', open('C:/Users/user/desktop/text.txt').read().lower())
for word in words:
    if word in wanted:
        cnt[word] += 1
print (cnt)

Results:

Counter({'e': 131, 'a': 119, 'by': 38, 'where': 16, 's': 14, 'also': 13, 'he': 4, 'whereby': 2, 'al': 2, 'b': 2, 'o': 1, 't': 1})

Questions:

  1. How do i omit all those 'e', 'a' 'by', 'where', etc.?
  2. If I then wanted to sum up the frequencies of words (also, thus, whereby) and divide them by total number of words in text, would that be possible?

disclaimer: this is not school assignment. i jut got lots of free time at work now and since i spend a lot of time with reading texts i decided to do this little project of mine to remind myself a bit of what i've been taught couple years ago.

Thanks in advance for any help.

Upvotes: 1

Views: 675

Answers (2)

PythonProgrammi
PythonProgrammi

Reputation: 23463

Reading from the web

I made this little mod of the code of Axel to read from a txt on the web, Alice in wonderland, to apply the code (as I don't have your txt file and I wanted to try it). So, I publish it here in case someone should need something like this.

from collections import Counter
import re
from urllib.request import urlopen
testo = str(urlopen("https://www.gutenberg.org/files/11/11.txt").read())
wanted = ["whereby", "also", "thus", "Alice", "down", "up", "cup"]
cnt = Counter()
words = re.findall('\w+', testo)
for word in words:
    if word in wanted:
        cnt[word] += 1
print(cnt)

total_cnt = sum(cnt.values())

print(float(total_cnt) / len(cnt))

output

Counter({'Alice': 334, 'up': 97, 'down': 90, 'also': 4, 'cup': 2})
105.4
>>> 

How many times the same word is found in adjacent sentences

This answer to the request (from the author of the question) of looking for how many times a word is found in adjacent sentences. If in a sentence there are more same words (ex.: 'had') and in the next there is another equal, I counted that for 1 ripetition. That is why I used the wordfound list.

from collections import Counter
import re


testo = """There was nothing so VERY remarkable in that; nor did Alice think it so? Thanks VERY much. Out of the way to hear the Rabbit say to itself, 'Oh dear! Oh dear! I shall be late!' (when she thought it over afterwards, it occurred to her that she ought to have wondered at this, but at the time it all seemed. Quite natural); but when the Rabbit actually TOOK A WATCH OUT OF ITS? WAISTCOAT-POCKET, and looked at it, and then hurried on.
Alice started to her feet, for it flashed across her mind that she had never before seen a rabbit. with either a waistcoat-pocket, or a watch to take out of it! and burning with curiosity, she ran across the field after it, and fortunately was just in time to see it pop? Down a large rabbit-hole under the hedge.
Alice opened the door and found that it led into a small passage, not much larger than a rat-hole: she knelt down and looked along the passage into the loveliest garden you ever saw. How she longed to get out of that dark hall, and wander about among those beds of bright flowers and those cool fountains, but she could not even get her head through the doorway; 'and even if my head would go through,' thought poor Alice, 'it would be of very little use without my shoulders. Oh, how I wish I could shut up like a telescope! I think I could, if I only knew how to begin.'For, you see, so many out-of-the-way things had happened lately, that Alice had begun to think that very few things indeed were really impossible. There seemed to be no use in waiting by the little door, so she went back to the table, half hoping she might find another key on it, or at any rate a book of rules for shutting people up like telescopes: this time she found a little bottle on it, ('which certainly was not here before,' said Alice,) and round the neck of the bottle was a paper label, with the words 'DRINK ME' beautifully printed on it in large letters. It was all very well to say 'Drink me,' but the wise little Alice was not going to do THAT in a hurry. 'No, I'll look first,' she said, 'and see whether it's marked "poison" or not'; for she had read several nice little histories about children who had got burnt, and eaten up by wild beasts and other unpleasant things, all because they WOULD not remember the simple rules their friends had taught them: such as, that a red-hot poker will burn you if you hold it too long; and that if you cut your finger VERY deeply with a knife, it usually bleeds; and she had never forgotten that, if you drink much from a bottle marked 'poison,' it is almost certain to disagree with you, sooner or later. However, this bottle was NOT marked 'poison,' so Alice ventured to taste it, and finding it very nice, (it had, in fact, a sort of mixed flavour of cherry-tart, custard, pine-apple, roast turkey, toffee, and hot buttered toast,) she very soon finished it off. """


frasi = re.findall("[A-Z].*?[\.!?]", testo, re.MULTILINE | re.DOTALL)

print("How many times this words are repeated in adjacent sentences:")
cnt2 = Counter()
for n, s in enumerate(frasi):
    words = re.findall("\w+", s)
    wordfound = []
    for word in words:
        try:
            if word in frasi[n + 1]:
                wordfound.append(word)
                if wordfound.count(word) < 2:
                    cnt2[word] += 1
        except IndexError:
            pass
for k, v in cnt2.items():
    print(k, v)

output

How many times this words are repeated in adjacent sentences:
had 1
hole 1
or 1
as 1
little 2
that 1
hot 1
large 1
it 5
to 5
a 6
not 3
and 2
s 1
me 1
bottle 1
is 1
no 1
the 6
how 1
Oh 1
she 2
at 1
marked 1
think 1
VERY 1
I 2
door 1
red 1
of 1
dear 1
see 1
could 2
in 2
so 1
was 1
poison 1
A 1
Alice 3
all 1
nice 1
rabbit 1

Upvotes: 1

Axel Persinger
Axel Persinger

Reputation: 331

As others have pointed out, you need to change your string wanted to a list. I just hardcoded a list, but you could do use str.split(" ") if you were passed a string in a function. I also implemented you the frequency counter. Just as a note, make sure you close your files; it's also easier (and recommended) that you use the open directive.

from collections import Counter
import re
wanted = ["whereby", "also", "thus"]
cnt = Counter()
with open('C:/Users/user/desktop/text.txt', 'r') as fp:
    fp_contents = fp.read().lower()
words = re.findall('\w+', fp_contents)
for word in words:
    if word in wanted:
        cnt[word] += 1
print (cnt)

total_cnt = sum(cnt.values())

print(float(total_cnt)/len(cnt))

Upvotes: 1

Related Questions