Reputation: 117
I am new to python. I have a simple program to find the number of times a word has been used in a website.
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
url = 'http://en.wikipedia.org/wiki/Albert_Einstein'
ourUrl = opener.open(url).read()
soup = BeautifulSoup(ourUrl)
dem = soup.findAll('p') #find paragraphs
for i in dem: # loop for each para
words = re.findall(r'\w+', i.text)
cap_words = [word.upper() for word in words]
word_counts = Counter(cap_words)
print word_counts
Thing is this gives me the word count para by para, instead of total word count of the website. What change is required. Also if i want to filter out common articles like a, an, the what code do i need to include.
Upvotes: 2
Views: 2098
Reputation: 37319
Assuming you really want to find only words contained in paragraphs, and are happy with your regexp, this is the minimal change to get the total word count of the retrieved document:
soup = BeautifulSoup(ourUrl)
dem = soup.findAll('p') #find paragraphs
word_counts = Counter()
for i in dem: # loop for each para
words = re.findall(r'\w+', i.text)
cap_words = [word.upper() for word in words]
word_counts.update(cap_words)
print word_counts
To ignore common words, one method would be to define a frozenset of ignorable words:
word_counts = Counter()
stopwords = frozenset(('A', 'AN', 'THE'))
for i in dem: # loop for each para
words = re.findall(r'\w+', i.text)
cap_words = [word.upper() for word in words if not word.upper() in stopwords]
word_counts.update(cap_words)
Upvotes: 1