Following links using Beautiful Soup?

Question

So I have just started learning about python using the Coursera online course "Python for Everybody", and I have this assignment where I have to follow links using beautiful soup. I saw this question pop up before but when I tried using it, it just didn't work. I managed to create something but the thing doesn't actually follow through the links but instead just stays on the same page. If possible can anyone provide materials that can give better insight on this assignment as well? Thanks.

import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl

ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input('Enter URL - ')
cnt = input("Enter count -")
count = int(cnt)
pn = input("Enter position -")
position = int(pn)-1

while count > 0:
    html = urllib.request.urlopen(url, context=ctx).read()
    soup = BeautifulSoup(html, "html.parser")
    tags = soup('a')
    lst = list()
    for tag in tags:
        lst.append(tag.get('href', None))
    indxpos = lst[position]
    count = count - 1
    print("Retrieving:", indxpos)

jdabtieu · Accepted Answer

You never set url to the new URL.

while count > 0:
    html = urllib.request.urlopen(url, context=ctx).read()  # Gets the page at url
    ...
    for tag in tags:
        lst.append(tag.get('href', None))  # Appends all the links to lst
    indxpos = lst[position]
    count = count - 1
    print("Retrieving:", indxpos)
    # What happens to lst?? you never use it

You should probably replace indxpos with url instead.

while count > 0:
    html = urllib.request.urlopen(url, context=ctx).read()  # Gets the page at url
    ...
    for tag in tags:
        lst.append(tag.get('href', None))  # Appends all the links to lst
    url = lst[position]
    count = count - 1
    print("Retrieving:", url)

This way, the next time the loop runs, it will fetch the new URL.

Also: If the page does not have pn links (e.g. pn=12, page has 2 links), you will get an exception if you try and access lst[position], because lst has less than pn elements.

Following links using Beautiful Soup?

Answers (2)

Related Questions