Reputation: 3501
I am writing a web crawler, but I have a problem with function which recursively calls links.
Let's suppose I have a page: http://en.wikipedia.org/wiki/Stirling_numbers_of_the_second_kind
.
I am looking for all links, and then open each link recursively, downloading again all links etc.
The problem is, that some links, although have different urls
, drive to the same page, for example:
http://en.wikipedia.org/wiki/Stirling_numbers_of_the_second_kind#mw-navigation
gives the same page as the previous link.
And I have an infinite loop.
Is any possibility to check if two links drive to the same page without comparing the all content of this pages?
Upvotes: 1
Views: 57
Reputation: 474041
No need to make extra requests to the same page.
You can use urlparse()
and check if the .path
part of the base url and the link you crawl is the same:
from urllib2 import urlopen
from urlparse import urljoin, urlparse
from bs4 import BeautifulSoup
url = "http://en.wikipedia.org/wiki/Stirling_numbers_of_the_second_kind"
base_url = urlparse(url)
soup = BeautifulSoup(urlopen(url))
for link in soup.find_all('a'):
if 'href' in link.attrs:
url = urljoin(url, link['href'])
print url, urlparse(url).path == base_url.path
Prints:
http://en.wikipedia.org/wiki/Stirling_numbers_of_the_second_kind#mw-navigation True
http://en.wikipedia.org/wiki/Stirling_numbers_of_the_second_kind#p-search True
http://en.wikipedia.org/wiki/File:Set_partitions_4;_Hasse;_circles.svg False
...
http://en.wikipedia.org/wiki/Equivalence_relation False
...
http://en.wikipedia.org/wiki/Stirling_numbers_of_the_second_kind True
...
https://www.mediawiki.org/ False
This particular example uses BeautifulSoup
to parse the wikipedia page and get all links, but the actual html parser here is not really important. Important is that you parse the links and get the path to check.
Upvotes: 1
Reputation: 4653
You can store the hash of the content of pages previously seen and check if the page has already been seen before continuing.
Upvotes: 1