Reputation: 19

Crawler Coding: determine if pages have been crawled?

I am working on a crawler in PHP that expects m URLs at which it finds a set of n links to n pages (internal pages) which are crawled for data. Links may be added or removed from the n set of links. I need to keep track of the links/pages so that i know which have been crawled, which ones are removed and which ones are new.

How should i go about to keep track of which m and n pages are crawled so that next crawl fetches new urls, re-checks still existing urls and ignores obsolete urls?

Upvotes: 1

Answers (1)

Naveed

Reputation: 42143

If you want to store these data for long term then use database. You can store crawled m URLs and their n URLs in database with their statuses. When you are going to crawl again first check database for crawled URLs.

For example:

Store your mURLs in mtable something like this:

 id |        mURL           | status       |    crawlingDate
------------------------------------------------------------------
 1  | example.com/one.php   | crawled      |   01-01-2010 12:30:00
 2  | example.com/two.php   | crawled      |   01-01-2010 12:35:10
 3  | example.com/three.php | not-crawled  |   01-01-2010 12:40:33

Now fetch each mURL from mtable and get all n URLs and store it in ntable something like this:

 id |        nURL             | mURL_id |  status      | crawlingDate
----------------------------------------------------------------------------
 1  | www.one.com/page1.php   |    1    |  crawled     | 01-01-2010 12:31:00
 2  | www.one.com/page2.php   |    1    |  crawled     | 01-01-2010 12:32:00
 3  | www.two.com/page1.php   |    2    |  crawled     | 01-01-2010 12:36:00
 4  | www.two.com/page2.php   |    2    |  crawled     | 01-01-2010 12:37:00
 5  | www.three.com/page1.php |    3    |  not-crawled | 01-01-2010 12:41:00
 6  | www.three.com/page2.php |    3    |  not-crawled | 01-01-2010 12:42:00

When you crawl next time first fetch all record from mtable one by one and get all nURLs from each mURL. Now store all nURLs in ntable if it does not already exists. Now start crawling each nURL to get data where status is not-crawled and set status to crawled when done. When all nURLs for one mURL are done then you can set status to crawled for that mURL in mtable.

If you don't want to use database and want to run crawler once for all then you can use this logic in arrays.

Probably this help to give you a direction.

Upvotes: 1

Crawler Coding: determine if pages have been crawled?

Answers (1)

Related Questions