rockstardev
rockstardev

Reputation: 13527

What dangers can a web crawler encounter?

I just finished writing a crawler and have been trying to think of reasons why crawling a certain website would be bad. I know the risk for modern browsers is primarily due to javascript. So my question is really, can a web crawler (written in PHP, or Java) scrape a site that could somehow cause damage to the crawler?

Upvotes: 1

Views: 2166

Answers (1)

Nathan
Nathan

Reputation: 1482

This truly depends on what you're web crawler does. If your crawler is just grabbing text from the HTML then for the most part you're fine. Of course, this assumes you're sanitizing the data before storing/displaying the data. If this is what you're doing, then the only real pain that I can think of is someone misdirecting you're crawler when you're following links. Depending on the user-agent you provide, they can essentially target and redirect your crawler anywhere they'd like. You can code to try preventing this, but it's obviously a difficult to completely avoid.

However, there's a few gotchas that could catch the web crawler to fall for. If it's not smart in what it's doing then it can fall for the spider trap. This basically creates an infinite loop of sites for you're crawler to hit, and is essentially devised to prevent web crawlers from crawling the site. This is sometimes unintentionally done which is why most web crawlers have a max crawl depth setting. (Chris Jester-Young touched on this in the comments, and has couple good points about following links that user can't see. ie. a link that has css as display: none)

The other thing is to obviously be polite. The webcrawler eats at a websites bandwidth and resources.

  • Be nice to the website's resources; throttle the crawler when hitting a site multiple times.
    • Some websites will block you're crawler if it tries crawling at a high rate.
  • Follow the robots.txt and the meta data so that you're only crawling locations the webmaster wants crawled.
  • If the website has policies against web crawling, then don't crawl the website.
    • This can usually be found in the robots.txt, or within the site's User Agreement.

Last but not least, you can face some legal penalties in some countries. Since I'm not a laywer, I'm not even going to attempt going into this. So look up local laws/regulations before letting the crawler go.

Upvotes: 2

Related Questions