Reputation: 53
I found the following web scraping code in Web Scraping with Python by Ryan Mitchel:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
pages = set()
def getLinks(pageUrl):
global pages
html = urlopen("http://en.wikipedia.org"+pageUrl)
bsObj = BeautifulSoup(html)
for link in bsObj.findAll("a", href=re.compile("^(/wiki/)")):
if 'href' in link.attrs:
if link.attrs['href'] not in pages:
#find new page
newPage = link.attrs['href']
print(newPage)
pages.add(newPage)
getLinks(newPage)
getLinks("")
I believe that in the findAll()
for loop, all tag objects with href
attributes that meet the criteria have already been retrieved. Why do we still need to check if the object has the href
attribute afterward?
In my opinion, I think that this line code should be deleted: if 'href' in link.attrs:
Do I think correctly?
Upvotes: 2
Views: 68