user2626758
user2626758

Reputation: 117

Find the most common words in a website

I am new to python. I have a simple program to find the number of times a word has been used in a website.

opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]

url = 'http://en.wikipedia.org/wiki/Albert_Einstein'
ourUrl = opener.open(url).read()
soup = BeautifulSoup(ourUrl)
dem = soup.findAll('p') #find paragraphs
for i in dem:    # loop for each para

    words = re.findall(r'\w+', i.text)   
    cap_words = [word.upper() for word in words]
    word_counts = Counter(cap_words)
    print word_counts

Thing is this gives me the word count para by para, instead of total word count of the website. What change is required. Also if i want to filter out common articles like a, an, the what code do i need to include.

Upvotes: 2

Views: 2098

Answers (1)

Peter DeGlopper
Peter DeGlopper

Reputation: 37319

Assuming you really want to find only words contained in paragraphs, and are happy with your regexp, this is the minimal change to get the total word count of the retrieved document:

soup = BeautifulSoup(ourUrl)
dem = soup.findAll('p') #find paragraphs
word_counts = Counter()
for i in dem:    # loop for each para
    words = re.findall(r'\w+', i.text)
    cap_words = [word.upper() for word in words]
    word_counts.update(cap_words)

print word_counts

To ignore common words, one method would be to define a frozenset of ignorable words:

word_counts = Counter()
stopwords = frozenset(('A', 'AN', 'THE'))
for i in dem:    # loop for each para
    words = re.findall(r'\w+', i.text)
    cap_words = [word.upper() for word in words if not word.upper() in stopwords]
    word_counts.update(cap_words)

Upvotes: 1

Related Questions