my-lord
my-lord

Reputation: 2923

How to parse html from javafx webview and transfer this data to Jsoup Document?

I am trying to parse sidebar TOC(Table of Components) of some documentation site.

Jsoup

I have tried Jsoup. I can not get TOC elements because the HTML content in this tag is not part of initial HTML but is set by JavaScript after the page is loaded.

You can see my previous question here:JSoup cannot parse child elements after depth 2

The suggested solution is to examine what connections are made manually from the Browser Dev Tools menu find the last version of the website. Parsing sidebar TOC of some documentation site is just one component of my java program so I cannot do this manually.

JavaFX Webview(not Android Webview)

I have tried JavaFX Webview because I need a browser that executes javascript code and fills Toc tag components.

WebView browser = new WebView();
WebEngine webEngine = browser.getEngine();
webEngine.load("https://learn.microsoft.com/en-us/ef/ef6/");

But I don't know how can I retrieve HTML code of the loaded website and transfer this data to Jsoup Document? ANy advice appreciated.

Upvotes: 0

Views: 1827

Answers (2)

Luk
Luk

Reputation: 2246

    WebView browser = new WebView();
    WebEngine webEngine = browser.getEngine();
    String url = "https://learn.microsoft.com/en-us/ef/ef6/";
    webEngine.load(url);
    //get w3c document from webEngine
    org.w3c.dom.Document w3cDocument = webEngine.getDocument();
    // use jsoup helper methods to convert it to string
    String html =  new org.jsoup.helper.W3CDom().asString(webEngine.get);
    // create jsoup document by parsing html
    Document doc = Jsoup.parse(url, html);

Upvotes: 3

Slaw
Slaw

Reputation: 45806

I can't promise this is the best way as I've not used Jsoup before and I'm not an expert on the XML API.

The org.jsoup.Jsoup class has a method for parsing HTML in String form: Jsoup.parse(String). This means we need to get the HTML from the WebView as a String. The WebEngine class has a document property that holds a org.w3c.dom.Document. This Document is the HTML content of the currently showing web page. We just need to convert this Document into a String, which we can do with a Transformer.

import java.io.StringWriter;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerException;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import org.jsoup.Jsoup;

public class Utils {

  private static Transformer transformer;

  // not thread safe
  public static org.jsoup.nodes.Document convert(org.w3c.dom.Document doc)
      throws TransformerException {
    if (transformer == null) {
      transformer = TransformerFactory.newDefaultInstance().newTransformer();
    }

    StringWriter writer = new StringWriter();
    transformer.transform(new DOMSource(doc), new StreamResult(writer));
    return Jsoup.parse(writer.toString());
  }

}

You would call this every time the document property changes. I did some "tests" by browsing Google and printing the org.jsoup.nodes.Document to the console and everything seems to be working.

There is a caveat, though; as far as I understand it the document property does not change when there are changes within the same web page (the Document itself may be updated, however). I'm not a web person, so pardon me if I don't make sense here, but I believe that this includes things like a frame changing its content. There may be a way around this by interfacing with the JavaScript using WebEngine.executeStript(String), but I don't know how.

Upvotes: 1

Related Questions