Peter
Peter

Reputation: 857

HtmlUnit - Convert an HtmlPage into HTML string?

I'm using HtmlUnit to generate the HTML for various pages, but right now, the best I can do to get the page into the raw HTML that the server returns is to convert the HtmlPage into an XML string.

This is somewhat annoying because the XML output is rendered by web browsers differently than the raw HTML would. Is there a way to convert an HtmlPage into raw HTML instead of XML?

Thanks!

Upvotes: 8

Views: 13228

Answers (6)

Pavlo
Pavlo

Reputation: 1

Here is my solution that works for me:

ScriptResult scriptResult = htmlPage.executeJavaScript("document.documentElement.outerHTML;");
System.out.println(scriptResult.getJavaScriptResult().toString());

Upvotes: 0

snorbi
snorbi

Reputation: 2890

I think there is no direct way to get the final page as HTML. asXml() returns the result as XML, asText() returns the extracted text content.

The best you can do is to use asXml() and "transform" it to HTML:

htmlPage.asXml().replaceFirst("<\\?xml version=\"1.0\" encoding=\"(.+)\"\\?>", "<!DOCTYPE html>")

(Of course you can apply more transformations like converting <br/> to <br> - it depends on your requirements.)

Even the related Google documentation recommends this approach (although they don't apply any transformations):

// return the snapshot
out.println(page.asXml());

Upvotes: 1

PooBucket
PooBucket

Reputation: 63

Maybe you want to go with something like this, instead of using the HtmlUnit framework's methods:

try (InputStreamReader isr = new InputStreamReader(url.openConnection().getInputStream());
                 BufferedReader br = new BufferedReader(isr);){

        String line ="";
        String htmlSource ="";

        while((line = br.readLine()) != null)
        {
            htmlSource += line + "\n";
        }


        return htmlSource;

        } catch (IOException e) {
         // TODO Auto-generated catch block
            e.printStackTrace();
        }

Upvotes: 0

mP.
mP.

Reputation: 18266

I dont know the answer short of a switch on Page type and for XmlPage and SgmlPage one must do an innerHTML on the HTML element and manually write out the attributes. Not elegant and exact (its missing the doctype) but it works.

Page.getWebResponse().getContentAsString()

This is incorrect as it returns the text form of the original unrendered, no js bytes. If javascript executes and changes stuff, then this method will not see the changes.

page.asXml() will return the HTML. page.asText() returns it rendered down to just text.

Just want to confirm this only returns text within text nodes and does not include the tags and their attributes. If you wish to take the complete HTML this is not the good enuff.

Upvotes: 0

Rodney Gitzel
Rodney Gitzel

Reputation: 2710

page.asXml() will return the HTML. page.asText() returns it rendered down to just text.

Upvotes: 12

Sergey O.
Sergey O.

Reputation: 101

I'm not 100% certain I understood the question correctly, but maybe this will address your issue:

page.getWebResponse().getContentAsString()

Upvotes: 6

Related Questions