Ziva
Ziva

Reputation: 3501

Predict if sites returns the same content

I am writing a web crawler, but I have a problem with function which recursively calls links. Let's suppose I have a page: http://en.wikipedia.org/wiki/Stirling_numbers_of_the_second_kind. I am looking for all links, and then open each link recursively, downloading again all links etc. The problem is, that some links, although have different urls, drive to the same page, for example: http://en.wikipedia.org/wiki/Stirling_numbers_of_the_second_kind#mw-navigation gives the same page as the previous link. And I have an infinite loop.

Is any possibility to check if two links drive to the same page without comparing the all content of this pages?

Upvotes: 1

Views: 57

Answers (2)

alecxe
alecxe

Reputation: 474041

No need to make extra requests to the same page.

You can use urlparse() and check if the .path part of the base url and the link you crawl is the same:

from urllib2 import urlopen
from urlparse import urljoin, urlparse
from bs4 import BeautifulSoup

url = "http://en.wikipedia.org/wiki/Stirling_numbers_of_the_second_kind"
base_url = urlparse(url)

soup = BeautifulSoup(urlopen(url))
for link in soup.find_all('a'):
    if 'href' in link.attrs:
        url = urljoin(url, link['href'])
        print url, urlparse(url).path == base_url.path

Prints:

http://en.wikipedia.org/wiki/Stirling_numbers_of_the_second_kind#mw-navigation True
http://en.wikipedia.org/wiki/Stirling_numbers_of_the_second_kind#p-search True
http://en.wikipedia.org/wiki/File:Set_partitions_4;_Hasse;_circles.svg False
...
http://en.wikipedia.org/wiki/Equivalence_relation False
...
http://en.wikipedia.org/wiki/Stirling_numbers_of_the_second_kind True
...
https://www.mediawiki.org/ False

This particular example uses BeautifulSoup to parse the wikipedia page and get all links, but the actual html parser here is not really important. Important is that you parse the links and get the path to check.

Upvotes: 1

Diego Allen
Diego Allen

Reputation: 4653

You can store the hash of the content of pages previously seen and check if the page has already been seen before continuing.

Upvotes: 1

Related Questions