user2626758
user2626758

Reputation: 117

Return most common words in a website, such that word count >5

I am new to python. I have a simple program to find the number of times a word has been used in a website.

opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]

url = 'https://en.wikipedia.org/wiki/Wolfgang_Amadeus_Mozart'
ourUrl = opener.open(url).read()
soup = BeautifulSoup(ourUrl)
dem = soup.findAll('p') #find paragraphs
word_counts = Counter()
stopwords = frozenset(('A', 'AN', 'THE'))


for i in dem:    # loop for each para
    words = re.findall(r'\w+', i.text)
    cap_words = [word.upper() for word in words if not word.upper() in stopwords]
    word_counts.update(cap_words)

print word_counts

Thing is, that this script gives a lot of words which are used only once. How can I update the script so that the word included, has at least 5 word count.

Also how can I arrange the top 5 most common words, into say word1, word2, word3.... etc.

Upvotes: 0

Views: 973

Answers (2)

Alex Woolford
Alex Woolford

Reputation: 4563

Try: print word_counts.most_common(5)

Upvotes: 0

Vyassa Baratham
Vyassa Baratham

Reputation: 1467

How can i update the script so that the word included, has atleast 5 word count.

You can filter the Counter as follows: filter(lambda x: x[1] > 5, word_counts.iteritems())

filter() takes a function and an iterable, applies the function to each element of the iterable, and only includes that item in the output if the function returned True. iteritems() returns a generator which yields key, value pairs over a dictionary.

how can i arrange the top 5 most common words, into say word1, word2, word3.... etc.

There is a most_common(n) Counter function. See http://docs.python.org/2/library/collections.html

Upvotes: 2

Related Questions