Enird
Enird

Reputation: 3

Following links using Beautiful Soup?

So I have just started learning about python using the Coursera online course "Python for Everybody", and I have this assignment where I have to follow links using beautiful soup. I saw this question pop up before but when I tried using it, it just didn't work. I managed to create something but the thing doesn't actually follow through the links but instead just stays on the same page. If possible can anyone provide materials that can give better insight on this assignment as well? Thanks.

import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl

ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input('Enter URL - ')
cnt = input("Enter count -")
count = int(cnt)
pn = input("Enter position -")
position = int(pn)-1

while count > 0:
    html = urllib.request.urlopen(url, context=ctx).read()
    soup = BeautifulSoup(html, "html.parser")
    tags = soup('a')
    lst = list()
    for tag in tags:
        lst.append(tag.get('href', None))
    indxpos = lst[position]
    count = count - 1
    print("Retrieving:", indxpos)

Upvotes: 0

Views: 671

Answers (2)

crusher083
crusher083

Reputation: 169

You don't have a function that interacts with the list of hyperlinks in your code, whatsoever. It will only print contents of "lst" list, but won't do anything with them.

Upvotes: 0

jdabtieu
jdabtieu

Reputation: 554

You never set url to the new URL.

while count > 0:
    html = urllib.request.urlopen(url, context=ctx).read()  # Gets the page at url
    ...
    for tag in tags:
        lst.append(tag.get('href', None))  # Appends all the links to lst
    indxpos = lst[position]
    count = count - 1
    print("Retrieving:", indxpos)
    # What happens to lst?? you never use it

You should probably replace indxpos with url instead.

while count > 0:
    html = urllib.request.urlopen(url, context=ctx).read()  # Gets the page at url
    ...
    for tag in tags:
        lst.append(tag.get('href', None))  # Appends all the links to lst
    url = lst[position]
    count = count - 1
    print("Retrieving:", url)

This way, the next time the loop runs, it will fetch the new URL.

Also: If the page does not have pn links (e.g. pn=12, page has 2 links), you will get an exception if you try and access lst[position], because lst has less than pn elements.

Upvotes: 1

Related Questions