baribari
baribari

Reputation: 1

How to only get a part of a HTML page?

What would be the best way to single out a part of an Html page which I obtained with a request by HttpClient4 from Apache and Java? Specifically I need a Table (it's contents).
Explanation, Example or Link would be great.

Upvotes: 0

Views: 387

Answers (2)

Alex Marshall
Alex Marshall

Reputation: 10312

Adrian Rodriguez' way isn't bad, but unfortunately it'll only work if the HTML is XHTML (ie validly formatted XML). You can use a library called Web Harvest (available on sourceforge.net) to scrape the page and extract the table declaratively rather than writing code to do it. It also includes phases in the build script for sanitizing the page as needed. I'd strongly recommend using that as it'd be a much more robust solution for what you want, especially if you're going to be needing to scrape other pages in the future.

Upvotes: 1

Adrian Rodriguez
Adrian Rodriguez

Reputation: 3242

What you could do is create a DOM object out of the response since it should be a valid document.

Do something like

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document document = builder.parse(/* your input stream from response */);
Element tableElement = document.getElementById("the-table-id");

Upvotes: 2

Related Questions