PearHead
PearHead

Reputation: 33

How to Download File from Javascript Link in HTMLUnit

As the title says, I am trying to download a file with HTMLUnit from a javascript link.

The page I am starting at is https://ppair.uspto.gov/TruePassWebStart/AuthenticationChooser.html. When I click on the link "Authenticate with Java Web Start (new method)" in a browser, a .jnlp file is downloaded, which then runs to opens a Java program window that asks for authentication information. Once authentication is successfull, the original browser window loads up the page with information I will be scraping.

The link source code snippet from the starting page is:

<tr>
<!-- onClick="return launchWebStart('authenticate');" -->
    <td><a href="javascript:void(0)" id="webstart-authenticate" ><font size="5">Authenticate with Java Web Start (new method)</font></a>
</tr>

The javascript file used for this process is found at https://ppair.uspto.gov/TruePassWebStart/js/WebStart.js. Basically, the javascript takes a cookie, encodes it, and appends it to a URL to request the jnlp file. I thought about just emulating this process, but the HTMLUnit documents discourage this. (They say "It is much better to manipulate the page as a user would by clicking on elements and shifting the focus around" so I am trying to do so.)

The problem I am having in HTMLUnit is that after I click() the appropriate anchor, I am unable to receive the expected jnlp file. I've tried several different things that I've found from other questions on this site, including:

HtmlUnit and JavaScript in links and HtmlUnit to invoke javascript from href to download a file

Here is the code I used:

import java.io.IOException;
import java.io.InputStream;
import java.net.MalformedURLException;

import com.gargoylesoftware.htmlunit.*;
import com.gargoylesoftware.htmlunit.html.HtmlAnchor;
import com.gargoylesoftware.htmlunit.html.HtmlPage;

public class Test {

    public static void main(String[] args) throws FailingHttpStatusCodeException, MalformedURLException, IOException {
        WebClient webClient = new WebClient(BrowserVersion.FIREFOX_45);

        // open starting webpage
        HtmlPage page = webClient.getPage("https://ppair.uspto.gov/TruePassWebStart/AuthenticationChooser.html");

        // id of the element where the link is
        String linkID = "webstart-authenticate";

        // identify the appropriate anchor
        HtmlAnchor anchor = (HtmlAnchor) page.getElementById(linkID);

        // click the anchor
        Page p = anchor.click();

        // get the InputStream for the response; print it out
        InputStream is = p.getWebResponse().getContentAsStream();
        int b = 0;
        while ((b = is.read()) != -1) {
            System.out.print((char)b);
        }
        webClient.close();
    }
}

What is printed out from the code above is the html from the starting webpage rather than the expected jnlp file. The console also prints out status updates from the javascript WebConsole every 3 seconds (at least if I have the code wait long enough), so I know that something is happening with the javascript (the functions launchWebStart and followMediator are in the separate javascript file WebStart.js):

Nov 21, 2016 2:53:25 PM com.gargoylesoftware.htmlunit.WebConsole info
INFO: launchWebStart

Nov 21, 2016 2:53:25 PM com.gargoylesoftware.htmlunit.WebConsole info
INFO: followMediator

Nov 21, 2016 2:53:25 PM com.gargoylesoftware.htmlunit.WebConsole info
INFO: responseReceived:200
WAIT

Nov 21, 2016 2:53:25 PM com.gargoylesoftware.htmlunit.WebConsole info
INFO: mediatorCallback: next wait

I also tried using a CollectingAttachmentHandler object as described at downloading files behind javascript button with htmlunit:

import java.io.IOException;
import java.net.MalformedURLException;
import java.util.List;

import com.gargoylesoftware.htmlunit.*;
import com.gargoylesoftware.htmlunit.attachment.Attachment;
import com.gargoylesoftware.htmlunit.attachment.CollectingAttachmentHandler;
import com.gargoylesoftware.htmlunit.html.HtmlAnchor;
import com.gargoylesoftware.htmlunit.html.HtmlPage;

public class Test2 {

    public static void main(String[] args) throws FailingHttpStatusCodeException, MalformedURLException, IOException {
        WebClient webClient = new WebClient(BrowserVersion.FIREFOX_45);

        // open starting webpage
        HtmlPage page = webClient.getPage("https://ppair.uspto.gov/TruePassWebStart/AuthenticationChooser.html");

        // id of the element where the link is
        String linkID = "webstart-authenticate";

        // identify the appropriate anchor
        HtmlAnchor anchor = (HtmlAnchor) page.getElementById(linkID);

        CollectingAttachmentHandler attachmentHandler = new CollectingAttachmentHandler();
        webClient.setAttachmentHandler(attachmentHandler);
        attachmentHandler.handleAttachment(anchor.click());
        List<Attachment> attachments = attachmentHandler.getCollectedAttachments();

        int i = 0;
        while (i < attachments.size()) {
            Attachment attachment = attachments.get(i);
            Page attachedPage = attachment.getPage();
            WebResponse attachmentResponse = attachedPage.getWebResponse();
            String content = attachmentResponse.getContentAsString();
            System.out.println(content);
            i++;
        }
        webClient.close();
    }
}

This code also prints out the content of the starting webpage. So none of the other solutions seems to work for me. I can't figure out what I'm doing wrong. I'm running out of ideas about how I might get this to work (and I thought it would be easy!) Any advice is greatly appreciated!

Upvotes: 3

Views: 1901

Answers (1)

qfrank
qfrank

Reputation: 272

here is a worked version based on your Test2

    WebClient webClient = new WebClient(BrowserVersion.FIREFOX_45);

    // open starting webpage
    HtmlPage page = webClient.getPage("https://ppair.uspto.gov/TruePassWebStart/AuthenticationChooser.html");

    // id of the element where the link is
    String linkID = "webstart-authenticate";

    // identify the appropriate anchor
    HtmlAnchor anchor = (HtmlAnchor) page.getElementById(linkID);

    CountDownLatch latch = new CountDownLatch(1);
    webClient.setWebStartHandler(new WebStartHandler(){

        @Override
        public void handleJnlpResponse(WebResponse webResponse)
        {
            System.out.println("downloading...");
            try (FileOutputStream fos = new FileOutputStream("/Users/Franklyn/Downloads/uspto-auth.authenticate2.jnlp"))
            {
                IOUtils.copy(webResponse.getContentAsStream(),fos);
            } catch (IOException e)
            {
                throw new RuntimeException(e);
            }
            System.out.println("downloaded");
            latch.countDown();
        }
    });

    anchor.click();
    latch.await();//wait downloading to finish

    webClient.close();

So why your Test2 is not working? because the responsed Content-Type corresponding download file is application/x-java-jnlp-file, u need use WebStartHandler instead. If response headers contains a header named 'Content-Disposition' and its value start with 'attachment',then your Test2 maybe will work fine.

Upvotes: 1

Related Questions