How to extract the text without HTML tags out of a webpage using HtmlUnit?

Question

I'm just getting started with HTMLUnit and what I'm looking to do is take a webpage and extract out the raw text from it minus all the html markup.

Can htmlunit accomplish that? If so, how? Or is there another library I should be looking at?

for example if the page contains

para1 test info
more stuff here

I'd like it to output

para1 test info more stuff here

thanks

Syntax · Accepted Answer

http://htmlunit.sourceforge.net/gettingStarted.html indicates that this is indeed possible.

@Test
public void homePage() throws Exception {
    final WebClient webClient = new WebClient();
    final HtmlPage page = webClient.getPage("http://htmlunit.sourceforge.net");
    assertEquals("HtmlUnit - Welcome to HtmlUnit", page.getTitleText());

    final String pageAsXml = page.asXml();
    assertTrue(pageAsXml.contains(""));

    final String pageAsText = page.asText();
    assertTrue(pageAsText.contains("Support for the HTTP and HTTPS protocols"));
}

NB: the page.asText() command seems to offer exactly what you are after.

Javadoc for asText (Inherited from DomNode to HtmlPage)

How to extract the text without HTML tags out of a webpage using HtmlUnit?

Answers (1)

Related Questions