Reputation: 15475
I'm just getting started with HTMLUnit and what I'm looking to do is take a webpage and extract out the raw text from it minus all the html markup.
Can htmlunit accomplish that? If so, how? Or is there another library I should be looking at?
for example if the page contains
<body><p>para1 test info</p><div><p>more stuff here</p></div>
I'd like it to output
para1 test info more stuff here
thanks
Upvotes: 5
Views: 3968
Reputation: 2197
http://htmlunit.sourceforge.net/gettingStarted.html indicates that this is indeed possible.
@Test
public void homePage() throws Exception {
final WebClient webClient = new WebClient();
final HtmlPage page = webClient.getPage("http://htmlunit.sourceforge.net");
assertEquals("HtmlUnit - Welcome to HtmlUnit", page.getTitleText());
final String pageAsXml = page.asXml();
assertTrue(pageAsXml.contains("<body class=\"composite\">"));
final String pageAsText = page.asText();
assertTrue(pageAsText.contains("Support for the HTTP and HTTPS protocols"));
}
NB: the page.asText() command seems to offer exactly what you are after.
Javadoc for asText (Inherited from DomNode to HtmlPage)
Upvotes: 5