Reputation: 99
I am trying to recursively crawl a Wikipedia url for all English article links. I want to perform a depth first traversal of n but for some reason my code is not recurring for every pass. Any idea why?
def crawler(url, depth):
if depth == 0:
return None
links = bs.find("div",{"id" : "bodyContent"}).findAll("a" , href=re.compile("(/wiki/)+([A-Za-z0-9_:()])+"))
print ("Level ",depth," ",url)
for link in links:
if ':' not in link['href']:
crawler("https://en.wikipedia.org"+link['href'], depth - 1)
This is the call to the crawler
url = "https://en.wikipedia.org/wiki/Harry_Potter"
html = urlopen(url)
bs = BeautifulSoup(html, "html.parser")
crawler(url,3)
Upvotes: 2
Views: 1977
Reputation: 7238
You need to get the page source (send a request to page) for every different URL. You are missing that part in your crawler()
function. Adding those lines outside the function, won't call them recursively.
def crawler(url, depth):
if depth == 0:
return None
html = urlopen(url) # You were missing
soup = BeautifulSoup(html, 'html.parser') # these lines.
links = soup.find("div",{"id" : "bodyContent"}).findAll("a", href=re.compile("(/wiki/)+([A-Za-z0-9_:()])+"))
print("Level ", depth, url)
for link in links:
if ':' not in link['href']:
crawler("https://en.wikipedia.org"+link['href'], depth - 1)
url = "https://en.wikipedia.org/wiki/Big_data"
crawler(url, 3)
Partial Output:
Level 3 https://en.wikipedia.org/wiki/Big_data
Level 2 https://en.wikipedia.org/wiki/Big_Data_(band)
Level 1 https://en.wikipedia.org/wiki/Brooklyn
Level 1 https://en.wikipedia.org/wiki/Electropop
Level 1 https://en.wikipedia.org/wiki/Alternative_dance
Level 1 https://en.wikipedia.org/wiki/Indietronica
Level 1 https://en.wikipedia.org/wiki/Indie_rock
Level 1 https://en.wikipedia.org/wiki/Warner_Bros._Records
Level 1 https://en.wikipedia.org/wiki/Joywave
Level 1 https://en.wikipedia.org/wiki/Electronic_music
Level 1 https://en.wikipedia.org/wiki/Dangerous_(Big_Data_song)
Level 1 https://en.wikipedia.org/wiki/Joywave
Level 1 https://en.wikipedia.org/wiki/Billboard_(magazine)
Level 1 https://en.wikipedia.org/wiki/Alternative_Songs
Level 1 https://en.wikipedia.org/wiki/2.0_(Big_Data_album)
Upvotes: 1