Reputation: 13527
I just finished writing a crawler and have been trying to think of reasons why crawling a certain website would be bad. I know the risk for modern browsers is primarily due to javascript. So my question is really, can a web crawler (written in PHP, or Java) scrape a site that could somehow cause damage to the crawler?
Upvotes: 1
Views: 2166
Reputation: 1482
This truly depends on what you're web crawler does. If your crawler is just grabbing text from the HTML then for the most part you're fine. Of course, this assumes you're sanitizing the data before storing/displaying the data. If this is what you're doing, then the only real pain that I can think of is someone misdirecting you're crawler when you're following links. Depending on the user-agent you provide, they can essentially target and redirect your crawler anywhere they'd like. You can code to try preventing this, but it's obviously a difficult to completely avoid.
However, there's a few gotchas that could catch the web crawler to fall for. If it's not smart in what it's doing then it can fall for the spider trap. This basically creates an infinite loop of sites for you're crawler to hit, and is essentially devised to prevent web crawlers from crawling the site. This is sometimes unintentionally done which is why most web crawlers have a max crawl depth
setting. (Chris Jester-Young touched on this in the comments, and has couple good points about following links that user can't see. ie. a link that has css as display: none
)
The other thing is to obviously be polite. The webcrawler eats at a websites bandwidth and resources.
Last but not least, you can face some legal penalties in some countries. Since I'm not a laywer, I'm not even going to attempt going into this. So look up local laws/regulations before letting the crawler go.
Upvotes: 2