Reputation: 428
I am writing a site crawler in java and I was wondering what the most sensible way to run it was? In other words, do I go the standard web app route and put in a web server and use some kind of message queue or do I forget about the container and run it as a standalone java app?
This is not an actual web crawler in that it only cares about x sites but I want to be constantly cycling through those sites (24 hours) to make sure that I have the latest content.
Upvotes: 0
Views: 289
Reputation: 718826
Ask yourself, is there any advantage (to you) in being able to access your web crawler via web requests. If not, there is no reason to put it in a web container.
... but I want to be constantly cycling through those sites (24 hours) to make sure that I have the latest content.
I hope you have the consent / permission of the site owners to do this. Otherwise, they are likely to take technical or legal measures to stop you doing this.
As Danny Thomas says, your crawler should implement a "robots.txt" handler, and respect what these files say when crawling.
FOLLOWUP
I may not visit the same page again for at least another 10-15 hours because of the number of sites I need visit. Is that still generally considered too much crawling?
That's not the right question to ask. The right question to ask is whether the specific site owners would consider that to be too much crawling.
How much is it costing them? Do they needs to do extra work to deal with the load caused by your crawling? Do they need to add capacity? Does it increase their running costs? (Network charges, electricity?)
Are you doing something with their content that could reduce their income; e.g. reduce the number of real hits on their site, the number of advert click-throughs?
What benefit do they gain from your crawling?
Is what you are doing for the public good? (Or is it just a way for you to make a buck out of their content?)
The only way to really know is to ask them.
Upvotes: 1