Count most common titular words in a paragraph of text

Question

I have to do a task where I open a text file, then count the number of times each word is capitalised. Then I need to print the top 3 occurrences. This piece of code works until it gets a text file with words that double up in a line.

txt file 1:

Jellicle Cats are black and white,
Jellicle Cats are rather small;
Jellicle Cats are merry and bright,
And pleasant to hear when they caterwaul.
Jellicle Cats have cheerful faces,
Jellicle Cats have bright black eyes;
They like to practise their airs and graces
And wait for the Jellicle Moon to rise.

Results:

6 Jellicle
5 Cats
2 And

txt file 2:

Baa Baa black sheep have you any wool?
Yes sir Yes sir, wool for everyone.
One for the master, 
One for the dame.
One for the little boy who lives down the lane.

Results:

1 Baa
1 One
1 Yes
1 Baa
1 One
1 Yes
1 Baa
1 One
1 Yes

Here is my code:

wc = {}
t3 = {}
p = 0
xx=0
a = open('novel.txt').readlines()
for i in a:
  b = i.split()
  for l in b:
    if l[0].isupper():
      if l not in wc:
         wc[l] = 1
      else:
        wc[l] += 1
while p < 3:
  p += 1
  max_val=max(wc.values())
  for words in wc:
    if wc[words] == max_val:
      t3[words] = wc[words]
      wc[words] = 1

    else:
      null = 1
while xx < 3:
  xx+=1
  maxval = max(t3.values())
  for word in sorted(t3):
    if t3[word] == maxval:
      print(t3[word],word)
      t3[word] = 1
    else:
      null+=1

Please help me solve this. Thank You!

Thank you for all the suggestions. After manually debugging the code, as well as using your responses, I was able to figure out that while xx < 3: was unnecessary, as well as wc[words] = 1 ended up making the program double count the words if the third most occurring word occurred once. By replacing it with wc[words] = 0 I was able to avoid having a counting loop.

Thank you!

cs95 · Accepted Answer

This is super simple. But you'll need a few tools.

re.sub, to get rid of punctuation
filter, to filter out words by title case using str.istitle
collections.Counter, to count words (do from collections import Counter first).

Assuming text holds your para (first one), this works:

In [296]: Counter(filter(str.istitle, re.sub('[^\w\s]', '', text).split())).most_common(3)
Out[296]: [('Jellicle', 6), ('Cats', 5), ('And', 2)]

Counter.most_common(x) returns the x most common words.

Coincidentally, this is the output for your second para:

[('One', 3), ('Baa', 2), ('Yes', 2)]

Count most common titular words in a paragraph of text

Answers (2)

Related Questions