Reputation: 169
I have to write a program that will read the HTML from this link(http://python-data.dr-chuck.net/known_by_Maira.html), extract the href= values from the anchor tags, scan for a tag that is in a particular position relative to the first name in the list, follow that link and repeat the process a number of times and report the last name you find.
I am supposed to find the link at position 18 (the first name is 1), follow that link and repeat this process 7 times. The answer is the last name that I retrieve.
Here is the code I found and it works just fine.
import urllib
from BeautifulSoup import *
url = raw_input("Enter URL: ")
count = int(raw_input("Enter count: "))
position = int(raw_input("Enter position: "))
names = []
while count > 0:
print "retrieving: {0}".format(url)
page = urllib.urlopen(url)
soup = BeautifulSoup(page)
tag = soup('a')
name = tag[position-1].string
names.append(name)
url = tag[position-1]['href']
count -= 1
print names[-1]
I would really appreciate if someone could explain to me like you would to a 10 year old, what's going on inside the while loop. I am new to Python and would really appreciate the guidance.
Upvotes: 2
Views: 5824
Reputation: 73
Solution with explanations.
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
url = input('Enter - ')
count = int(input('Enter count: '))
position = int(input ('Enter position: '))
names = []
while count > 0:
print('Retrieving: {}'.format(url))
html = urllib.request.urlopen(url) # open the url using urllib
soup = BeautifulSoup(html, 'html.parser')# parse html data in a clean format
# Retrieve all of the anchor tags
tags = soup('a')
# This gets the <a> tag at position-1 and then gets its text value
name = tags[position-1].string
names.append(name) #add the name to our list
url = tags[position-1]['href']#retrieve the url for next iteratopn
count -= 1
print(names)
print('Answer: ',names[count-1])
Hope it helps.
Upvotes: 0
Reputation: 9
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
total=0
url = input('Enter - ')
c=input('enter count-')
count=int(c)
p=input('enter position-')
pos=int(p)
while total<=count:
html = urllib.request.urlopen(url, context=ctx).read()
print("Retrieving",url)
soup = BeautifulSoup(html, 'html.parser')
tags = soup('a')
counter=0
for tag in tags:
counter=counter+1
if(counter<=pos):
x=tag.get('href',None)
url=x
else:
break
total=total+1
Upvotes: 0
Reputation: 77357
while count > 0: # because of `count -= 1` below,
# will run loop count times
print "retrieving: {0}".format(url) # just prints out the next web page
# you are going to get
page = urllib.urlopen(url) # urls reference web pages (well,
# many types of web content but
# we'll stick with web pages)
soup = BeautifulSoup(page) # web pages are frequently written
# in html which can be messy. this
# package "unmessifies" it
tag = soup('a') # in html you can highlight text and
# reference other web pages with <a>
# tags. this get all of the <a> tags
# in a list
name = tag[position-1].string # This gets the <a> tag at position-1
# and then gets its text value
names.append(name) # this puts that value in your own
# list.
url = tag[position-1]['href'] # html tags can have attributes. On
# and <a> tag, the href="something"
# attribute references another web
# page. You store it in `url` so that
# its the page you grab on the next
# iteration of the loop.
count -= 1
Upvotes: 2
Reputation: 12189
You enter the number of urls you want to retrieve from a page
0) prints url
1) opens url
2) reads source
BeautifulSoup docs
3) gets every a
tag
4) gets the whole <a ...></a>
I think
5) adds it to a list names
6) gets url from the last item of names
, ie pulls href
from <a ...></a>
7) prints the last of the list names
Upvotes: 0