Reputation: 1
What would be the best way to single out a part of an Html page which I obtained with a request by HttpClient4 from Apache and Java? Specifically I need a Table (it's contents).
Explanation, Example or Link would be great.
Upvotes: 0
Views: 387
Reputation: 10312
Adrian Rodriguez' way isn't bad, but unfortunately it'll only work if the HTML is XHTML (ie validly formatted XML). You can use a library called Web Harvest (available on sourceforge.net) to scrape the page and extract the table declaratively rather than writing code to do it. It also includes phases in the build script for sanitizing the page as needed. I'd strongly recommend using that as it'd be a much more robust solution for what you want, especially if you're going to be needing to scrape other pages in the future.
Upvotes: 1
Reputation: 3242
What you could do is create a DOM object out of the response since it should be a valid document.
Do something like
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document document = builder.parse(/* your input stream from response */);
Element tableElement = document.getElementById("the-table-id");
Upvotes: 2