Reputation: 3
So I have just started learning about python using the Coursera online course "Python for Everybody", and I have this assignment where I have to follow links using beautiful soup. I saw this question pop up before but when I tried using it, it just didn't work. I managed to create something but the thing doesn't actually follow through the links but instead just stays on the same page. If possible can anyone provide materials that can give better insight on this assignment as well? Thanks.
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url = input('Enter URL - ')
cnt = input("Enter count -")
count = int(cnt)
pn = input("Enter position -")
position = int(pn)-1
while count > 0:
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, "html.parser")
tags = soup('a')
lst = list()
for tag in tags:
lst.append(tag.get('href', None))
indxpos = lst[position]
count = count - 1
print("Retrieving:", indxpos)
Upvotes: 0
Views: 671
Reputation: 169
You don't have a function that interacts with the list of hyperlinks in your code, whatsoever. It will only print contents of "lst" list, but won't do anything with them.
Upvotes: 0
Reputation: 554
You never set url
to the new URL.
while count > 0:
html = urllib.request.urlopen(url, context=ctx).read() # Gets the page at url
...
for tag in tags:
lst.append(tag.get('href', None)) # Appends all the links to lst
indxpos = lst[position]
count = count - 1
print("Retrieving:", indxpos)
# What happens to lst?? you never use it
You should probably replace indxpos with url instead.
while count > 0:
html = urllib.request.urlopen(url, context=ctx).read() # Gets the page at url
...
for tag in tags:
lst.append(tag.get('href', None)) # Appends all the links to lst
url = lst[position]
count = count - 1
print("Retrieving:", url)
This way, the next time the loop runs, it will fetch the new URL.
Also: If the page does not have pn
links (e.g. pn=12, page has 2 links), you will get an exception if you try and access lst[position]
, because lst has less than pn
elements.
Upvotes: 1