Reputation: 1
I am currently in the process of developing an application that will request some information from Websites. What I'm looking to do is parse the HTML files through a connection online. I was just wondering, by parsing the Website will it put any strain on the server, will it have to download any excess information or will it simply connect to the site as I would do through my browser and then scan the source?
If this is putting extra strain on the Website then I'm going to have to make a special request to some of the companies I'm scanning. However if not then I have the permission to do this.
I hope this made some sort of sense. Kind regards, Jamie.
Upvotes: 0
Views: 1754
Reputation: 21795
Your Java program hitting other people's server to download the content of a URL won't put any more strain on the server than a web browser doing so-- essentially they're precisely the same operation. In fact, you probably put less strain on them, because your program probably won't be bothered about downloading images, scripts etc that a web browser would.
BUT:
Of course, within some bounds of "non-malicious activity", in general it's perfectly legal for you to make whatever request you want whenever you want to whatever server. But equally, that server has a right to serve or deny you that page. So to prevent yourself from being blocked, one way or another, you need to either get approval from the server owners, or "keep a low profile" in your requests.
Upvotes: 0
Reputation: 14439
You could use htmlunit. It gives you virtual gui less browser.
Upvotes: 0
Reputation: 13564
Besides Hank Gay's recommendation, I can only suggest that you can also re-use some open-source HTML parser, such as Jsoup, for parsing/processing the downloaded HTML files.
Upvotes: 0
Reputation: 34150
Depends on the website. If you do this to Google then most likely you will be on a hold for a day. If you parse Wikipedia, (which I have done myself) it won't be a problem because its already a huge, huge website.
If you want to do it the right way, first respect robots.txt, then try to scatter your requests. Also try to do it when the traffic is low. Like around midnight and not at 8AM or 6PM when people get to computers.
Upvotes: 0
Reputation: 18601
No extra strain on other people servers. The server will get your simple HTML GET request, it won't even be aware that you're then parsing the page/html.
Have you checked this: JSoup?
Upvotes: 2
Reputation: 71939
Consider doing the parsing and the crawling/scraping in separate steps. If you do that, you can probably use an existing open-source crawler such as crawler4j
that already has support for politeness delays, robots.txt, etc. If you just blindly go grabbing content from somebody's site with a bot, the odds are good that you're going to get banned (or worse, if the admin is feeling particularly vindictive or creative that day).
Upvotes: 1