JSoup CSS / DOM questions

Question

1. (From:https://www.virustotal.com/en/file/7b6b268cbca9d421aabba5f08533d3dcaba50e0f7887b07ef2bd66bf218b35ff/analysis/)

I want to get the text in the picture, in Google Developer Tools I would do that (I basically went into another childnode of the span to find the md5 in DevTools but in Jsoup it seems different and only returns the "md5" text)

document.getElementById("additional-info-content").childNodes[1].children[1].childNodes[1].innerHTML

I cant manage to get it using JSoup dom/selector. (If it's possible to give both of these examples)

2.

How do I specify a child in CSS in Jsoup? For example, I right click on the span class field above the blue marked line in the picture, and click "Copy Selector":

#file-details > div:nth-child(2) > div:nth-child(1) > span

It gives me file-details as first div, even thought its not the only file-details in the document, but okay, lets say it should be like that(?):

#additional-info-content > div:file-details > div:nth-child(2) > div:nth-child(1) > span

How do I manage to translate it into a working JSoup CSS script with the child? (If possible then DOM example aswell)

3.

Is there a good insight on how to look and how to find the right path when looking for a specific value/node?

What I do now is basically open Developer Tools, then click on a unique div class name, and I check the properties window inside the DevTools for the child nodes, and keep digging with the child nodes till I find the right path...(Like I copied in the first question)

Is there a better way to look at this?

I mean, using the DevTools console is so simple, just writing .children[1].childnodes[3].children[1] while looking at the properties and seeing the correct attribute that I need, but I know it's not the right way I guess?

Zack · Accepted Answer

1)

    // connect to url and retrieve source code as document
    Document doc = Jsoup
            .connect(url)
            .userAgent("Mozilla/5.0")
            .referrer("http://www.google.com")
            .get();

    String md5= doc

            // use CSS selector to grab only enums which contain md5
            .select("div#file-details.extra-info > div.enum-container > div.enum:contains(md5)")

            // use the first element in the result set
            .first()

            // use only its text node and ignore the text node of the span
            .ownText();

2) There are lots of ways to specify children. You can use CSS selectors or some of the jsoup convenience methods.

If I want to extract the text foo from the following html:


 
   
   foo 
   bar

Each of these will produce the same result:

    doc.select("div > span > b").last().ownText();

    doc.select("div > span > b").get(1).ownText();

    doc.select("div > span:last-child > b").text();

    doc.select("div > span:last-child").text();

    doc.select("div > span").last().text();

    doc.select("div > span").get(1).text();

    doc.select("div > span:last-child > b").first().ownText();

    doc.select("span > b").last().text();

Deciding which way to go really depends on the HTML structure of the document you are parsing. See CSS Selectors for more examples.

3) Examine the source code, not the code rendered in the browser. Jsoup does not invoke JavaScript. If the DOM of your page is changed onLoad, then you need to render the page before parsing it. Here is an example of how to do this: https://stackoverflow.com/a/38572859/1176178

JSoup CSS / DOM questions

Answers (1)

Related Questions