Reputation: 61
I tried to sum up the values that I scraped from the html, however the sum seem very strange.(It obviously lower than the actual value.)
I have looked over other people code and I noticed that they use the re.findall()
to find the numbers in html.
My question is that why I could not directly crawl the content element from the html? my code is in above and the bottom one is part of code that other people's code different from mine code.
Thank you for your answer in advance!
# load in the required packages for reading HTML
from urllib.request import urlopen
from bs4 import BeautifulSoup #parser for HTML
import ssl
import re
# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
#open the url
url = 'http://py4e-data.dr-chuck.net/comments_874984.html'
html = urlopen(url, context = ctx).read()
soup = BeautifulSoup(html, "html.parser")
# Retrive the information form url
spans = soup('span')
sum = 0
for span in spans:
x = span.contents[0]
for n in x:
sum = sum + int(n)
print(sum)
sum=0
# Retrieve all of the anchor tags
tags = soup('span')
for tag in tags:
# Look at the parts of a tag
y=str(tag)
x= re.findall("[0-9]+",y)
for i in x:
i=int(i)
sum=sum+i
print(sum)
Upvotes: 1
Views: 135
Reputation: 24930
If I understand you correctly, this should get you there:
counter = 0
for comment in soup.select('span.comments'):
counter+=int(comment.text)
print(counter)
or even shorter:
comments = [int(comment.text) for comment in soup.select('span.comments')]
print(sum(comments))
Output, in both cases:
2266
Upvotes: 1