Scraping HTML by elements in Python with BeautifulSoup

Question

I tried to sum up the values that I scraped from the html, however the sum seem very strange.(It obviously lower than the actual value.)

I have looked over other people code and I noticed that they use the re.findall() to find the numbers in html.

My question is that why I could not directly crawl the content element from the html? my code is in above and the bottom one is part of code that other people's code different from mine code.

Thank you for your answer in advance!

# load in the required packages for reading HTML

from urllib.request import urlopen
from bs4 import BeautifulSoup #parser for HTML
import ssl
import re
# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

#open the url
url = 'http://py4e-data.dr-chuck.net/comments_874984.html'
html = urlopen(url, context = ctx).read()
soup = BeautifulSoup(html, "html.parser")

# Retrive the information form url
spans = soup('span')
sum = 0
for span in spans:
    x = span.contents[0]
    for n in x:
        sum = sum + int(n)
print(sum)

sum=0
# Retrieve all of the anchor tags
tags = soup('span')
for tag in tags:
    # Look at the parts of a tag
    y=str(tag)
    x= re.findall("[0-9]+",y)
    for i in x:
        i=int(i)
        sum=sum+i
print(sum)

Jack Fleeting · Accepted Answer

If I understand you correctly, this should get you there:

counter = 0
for comment in soup.select('span.comments'):
    counter+=int(comment.text)
print(counter)

or even shorter:

comments = [int(comment.text) for comment in soup.select('span.comments')]
print(sum(comments))

Output, in both cases:

Scraping HTML by elements in Python with BeautifulSoup

Answers (1)

Related Questions