Reputation: 15
Here is my homework:
In this assignment you will write a Python program similar to http://www.py4e.com/code3/urllink2.py. The program will use urllib to read the HTML from the data files below, and parse the data, extracting numbers and compute the sum of the numbers in the file.
We provide two files for this assignment. One is a sample file where we give you the sum for your testing and the other is the actual data you need to process for the assignment.
Sample data: http://py4e-data.dr-chuck.net/comments_42.html (Sum=2553)
Actual data: http://py4e-data.dr-chuck.net/comments_228869.html (Sum ends with 10)
You do not need to save these files to your folder since your program will read the data directly from the URL. Note: Each student will have a distinct data url for the assignment - so only use your own data url for analysis.
I want to fix mine code as it is what I have learned so far. I am getting an error as a name
urlib is not defined
.. if I play with imports than I have a problem with sockets.
import urllib
import re
from bs4 import BeautifulSoup
url = input('Enter - ')
html = urlib.request(url, context=ctx).read()
soup = BeautifulSoup(html, "html.parser")
sum=0
# Retrieve all of the anchor tags
tags = soup('span')
for tag in tags:
# Look at the parts of a tag
y=str(tag)
x= re.findall("[0-9]+",y)
for i in x:
i=int(i)
sum=sum+i
print(sum)
Upvotes: 1
Views: 24738
Reputation: 31
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
url = input('Enter - ')
html = urlopen(url,).read()
soup = BeautifulSoup(html, "html.parser")
# Retrieve all of the anchor tags
tags = soup('span')
numlist = list()
for tag in tags:
# Look at the parts of a tag
y = str(tag)
num = re.findall('[0-9]+',y)
numlist = numlist + num
sum = 0
for i in numlist:
sum = sum + int(i)
print(sum)
Upvotes: 3
Reputation: 11
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
import re
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url = input('Enter - ')
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, "html.parser")
sum=0
# Retrieve all of the anchor tags
tags = soup('span')
for tag in tags:
# Look at the parts of a tag
y=str(tag)
x= re.findall("[0-9]+",y)
for i in x:
i=int(i)
sum=sum+i
print(sum)
Upvotes: 1
Reputation: 195438
Typo: you're having urlib
, it should be urllib
. The context=ctx
isn't necessary:
import re
import urllib
from bs4 import BeautifulSoup
# url = 'http://py4e-data.dr-chuck.net/comments_42.html'
url = 'http://py4e-data.dr-chuck.net/comments_228869.html'
soup = BeautifulSoup(urllib.request.urlopen(url).read(), 'html.parser')
s = sum(int(td.text) for td in soup.select('td:last-child')[1:])
print(s)
Prints:
2410
EDIT: Running your script:
import urllib.request
import re
from bs4 import BeautifulSoup
html = urllib.request.urlopen('http://py4e-data.dr-chuck.net/comments_228869.html').read()
soup = BeautifulSoup(html, "html.parser")
sum=0
# Retrieve all of the anchor tags
tags = soup('span')
for tag in tags:
# Look at the parts of a tag
y=str(tag)
x= re.findall("[0-9]+",y)
for i in x:
i=int(i)
sum=sum+i
print(sum)
Prints:
2410
Upvotes: 2