Laura
Laura

Reputation: 181

Getting pure html content from web page

I'm trying to retrieve wysiwyg html content from a web page (generated with apache wicket, but I don't think it cares). I tried the solutions described here but I always get an HTML body like the one that follows:

<body>
    <div
    style="width: 830px; height: 300px; margin: auto; margin-top: 50px;">
        <div wicket:id="rangeBar"
        style="float: left; width: 400px; height: 300px; margin-right: 30px;"
        id="rangeBar1"></div>
    </div>
</body>

I was expecting to retrieve data similar to the one I see in the browser web console like:

<body>
    <div style="width: 830px; height: 300px; margin: auto; margin-top: 50px;">
        <div wicket:id="rangeBar" style="float: left; width: 400px; height: 300px; margin-right: 30px;" id="rangeBar1" class="shield-chart">
            <div id="shielddw" class="shield-container" style="position: relative; overflow: hidden; width: 400px; height: 300px; line-height: normal; z-index: 0; font-family: &amp; amp; #39; Segoe UI&amp;amp; #39; , Tahoma , Verdana, sans-serif; font-size: 12px;">
                <svg xmlns="http://www.w3.org/2000/svg" version="1.1" width="400" height="300">
                    <defs>
                    <clippath id="shielddx">
                    <rect rx="0" ry="0" fill="none" x="0" y="0" width="9999" height="300" stroke-width="0.000001"></rect></clippath>
                    <clippath id="shielddy">
                    <rect fill="none" x="0" y="0" width="331" height="210"></rect></clippath>
                    <filter id="a5a87bf2-0ea3-4664-8ceb-bd50b883a117" height="120%">
                    <fegaussianblur in="SourceAlpha" stdDeviation="3"></fegaussianblur>
                    <fecomponenttransfer>
                    <fefunca type="linear" slope="0.2"></fefunca></fecomponenttransfer>
                    <femerge>
                    <femergenode></femergenode>
                    <femergenode in="SourceGraphic"></femergenode></femerge></filter></defs>
                    <rect rx="0" ry="0" fill="#2D2D2D" x="0" y="0" width="400"
                    height="300" stroke-width="0.000001"></rect>  
                      ..... 
                 </svg>
            </div>
            <div class="shield-tooltip" style="pointer-events: none"></div>
        </div>
    </div>
</body>

Is there any way for getting such content in java?

Thanks, Laura

UPDATE: Here is my java code

HttpClientBuilder builder = HttpClientBuilder.create();
CloseableHttpClient httpclient = builder.build();
HttpGet httpget = new HttpGet(TEST_WEB_PAGE);
HttpResponse response = httpclient.execute(httpget);
InputStream content = response.getEntity().getContent();
OutputStream htmlStream = null;
File htmlFile = new File(ROOT + "etc/html/demo_apache_" + new Date() + ".html");
try {
    htmlStream = new FileOutputStream(htmlFile);
    byte[] buffer = new byte[8 * 1024];
    int bytesRead;
    while ((bytesRead = content.read(buffer)) != -1) {
        htmlStream.write(buffer, 0, bytesRead);
    }
} finally {
    if (htmlStream != null)
        htmlStream.close();
}

Upvotes: 2

Views: 490

Answers (1)

Aidy J
Aidy J

Reputation: 305

Is there any JavaScript included in the head tag that might be populating the div after the page has loaded?

If you obtain the page programmatically with Java, this JavaScript will not be executed.

Upvotes: 3

Related Questions