Find the most common words in a website

Question

I am new to python. I have a simple program to find the number of times a word has been used in a website.

opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]

url = 'http://en.wikipedia.org/wiki/Albert_Einstein'
ourUrl = opener.open(url).read()
soup = BeautifulSoup(ourUrl)
dem = soup.findAll('p') #find paragraphs
for i in dem:    # loop for each para

    words = re.findall(r'\w+', i.text)   
    cap_words = [word.upper() for word in words]
    word_counts = Counter(cap_words)
    print word_counts

Thing is this gives me the word count para by para, instead of total word count of the website. What change is required. Also if i want to filter out common articles like a, an, the what code do i need to include.

Find the most common words in a website

Answers (1)

Related Questions