Jamie McLeish
Jamie McLeish

Reputation: 1

Parsing a HTML file in Java

I am currently in the process of developing an application that will request some information from Websites. What I'm looking to do is parse the HTML files through a connection online. I was just wondering, by parsing the Website will it put any strain on the server, will it have to download any excess information or will it simply connect to the site as I would do through my browser and then scan the source?

If this is putting extra strain on the Website then I'm going to have to make a special request to some of the companies I'm scanning. However if not then I have the permission to do this.

I hope this made some sort of sense. Kind regards, Jamie.

Upvotes: 0

Views: 1754

Answers (6)

Neil Coffey
Neil Coffey

Reputation: 21795

Your Java program hitting other people's server to download the content of a URL won't put any more strain on the server than a web browser doing so-- essentially they're precisely the same operation. In fact, you probably put less strain on them, because your program probably won't be bothered about downloading images, scripts etc that a web browser would.

BUT:

  • if you start bombarding a server of a company with moderate resources with downloads or start exhibiting obvious "robot" patterns (e.g. downloading precisely every second), they'll probably block you; so put some sensible constraints on what you do (e.g. every consecutive download to the same server happens at random intervals of between 10 and 20 seconds);
  • when you make your request, you probably want to set the "referer" request header either to mimic an actual browser, or to be open about what it is (invent a name for your "robot", create a page explaining what it does and include a URL to that page in the referer header)-- many server owners will let through legitimate, well-behaved robots, but block "suspicious" ones where it's not clear what they're doing;
  • on a similar note, if you're doing things "legally", don't fetch pages that the site's "robot.txt" files prohibits you from fetching.

Of course, within some bounds of "non-malicious activity", in general it's perfectly legal for you to make whatever request you want whenever you want to whatever server. But equally, that server has a right to serve or deny you that page. So to prevent yourself from being blocked, one way or another, you need to either get approval from the server owners, or "keep a low profile" in your requests.

Upvotes: 0

michael nesterenko
michael nesterenko

Reputation: 14439

You could use htmlunit. It gives you virtual gui less browser.

Upvotes: 0

MarcoS
MarcoS

Reputation: 13564

Besides Hank Gay's recommendation, I can only suggest that you can also re-use some open-source HTML parser, such as Jsoup, for parsing/processing the downloaded HTML files.

Upvotes: 0

Amir Raminfar
Amir Raminfar

Reputation: 34150

Depends on the website. If you do this to Google then most likely you will be on a hold for a day. If you parse Wikipedia, (which I have done myself) it won't be a problem because its already a huge, huge website.

If you want to do it the right way, first respect robots.txt, then try to scatter your requests. Also try to do it when the traffic is low. Like around midnight and not at 8AM or 6PM when people get to computers.

Upvotes: 0

Marsellus Wallace
Marsellus Wallace

Reputation: 18601

No extra strain on other people servers. The server will get your simple HTML GET request, it won't even be aware that you're then parsing the page/html.

Have you checked this: JSoup?

Upvotes: 2

Hank Gay
Hank Gay

Reputation: 71939

Consider doing the parsing and the crawling/scraping in separate steps. If you do that, you can probably use an existing open-source crawler such as crawler4j that already has support for politeness delays, robots.txt, etc. If you just blindly go grabbing content from somebody's site with a bot, the odds are good that you're going to get banned (or worse, if the admin is feeling particularly vindictive or creative that day).

Upvotes: 1

Related Questions