Reputation:
I am attempting to make a little web crawler in python. What seems to be tripping me up right now is the recursive part and depth of this problem. Given a url and a maxDepth of how many sites from there I want to link to I then add the url to the set of searched sites, and download all the text and links from the site. For all the links that the url has in it, I want to search each link and get it's words and links. What the problem is, is that when I go to recursively call the next url, the depth is already at maxDepth and it stops after going to only 1 more page. Hopefully I explained it decently, basically the question I am asking is how do I do all the recursive calls and then set self._depth += 1?
def crawl(self,url,maxDepth):
self._listOfCrawled.add(url)
text = crawler_util.textFromURL(url).split()
for each in text:
self._index[each] = url
links = crawler_util.linksFromURL(url)
if self._depth < maxDepth:
self._depth = self._depth + 1
for i in links:
if i not in self._listOfCrawled:
self.crawl(i,maxDepth)
Upvotes: 3
Views: 5529
Reputation: 82899
The problem with your code is that you increase self.depth
each time you call the function, and since it is a variable of the instance, it stays increased in the following calls. Let's say maxDepth
is 3 and you have a URL A
that links to pages B
, and C
, and B
links to D
, and C
has a link to E
. Your call hierarchy then looks like this (assuming that self._depth
is 0 at the beginning):
crawl(self, A, 3) # self._depth set to 1, following links to B and C
crawl(self, B, 3) # self._depth set to 2, following link to D
crawl(self, D, 3) # self._depth set to 3, no links to follow
crawl(self, C, 3) # self._depth >= maxDepth, skipping link to E
In other words, instead of the depth
of the current call, you track the accumulated number of calls to crawl
.
Instead, try something like this:
def crawl(self,url,depthToGo):
# call this method with depthToGo set to maxDepth
self._listOfCrawled.add(url)
text = crawler_util.textFromURL(url).split()
for each in text:
# if word not in index, create a new set, then add URL to set
if each not in self._index:
self._index[each] = set([])
self._index[each].add(url)
links = crawler_util.linksFromURL(url)
# check if we can go deeper
if depthToGo > 0:
for i in links:
if i not in self._listOfCrawled:
# decrease depthToGo for next level of recursion
self.crawl(i, depthToGo - 1)
Upvotes: 3