Reputation:
I am trying to read the HTML from the data files below, extract the href= vaules from the anchor tags, scan for a tag that is in a particular position relative to the first name in the list, follow that link and repeat the process a number of times and report the last name I find:
(URL: http://py4e-data.dr-chuck.net/known_by_Emir.html)
Find the link at position 18 (the first name is 1). Follow that link. Repeat this process 7 times. The answer is the last name that you retrieve. Hint: The first character of the name of the last page that you will load is: M
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url = input('Enter - ')
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')
tags = soup('a')
for tag in tags:
print(tag.get('href', None))
Upvotes: 0
Views: 510
Reputation: 5531
You can put the whole thing within a loop:
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url = 'http://py4e-data.dr-chuck.net/known_by_Emir.html'
for x in range(7):
if x != 0:
html = urllib.request.urlopen(tag, context=ctx).read()
else:
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')
tag = soup.find_all('a')[17]['href']
print(tag)
Output:
http://py4e-data.dr-chuck.net/known_by_Maya.html
Upvotes: 1