Shivansh Potdar
Shivansh Potdar

Reputation: 1267

get all links from a div with JSoup

Basically, I am using Jsoup to parse a site, I want to get all the links from the following html:

<ul class="detail-main-list">
  <li> 
    <a href="/manga/toki_wa/v01/c001/1.html" title="Toki wa... Vol.01 Ch.001 -Toki wa... target="_blank"> Dis Be the link</a>
   </li> 
</ul>

Any idea how?

Upvotes: 0

Views: 681

Answers (2)

Ashish Karn
Ashish Karn

Reputation: 1143

You can do a specific a href link in this way from any website.

public static void main(String[] args) {
    String htmlString = "<html>\n" +
            " <head></head>\n" +
            " <body>\n" +
            "<ul class=\"detail-main-list\">\n" +
            "  <li> \n" +
            "    <a href=\"/manga/toki_wa/v01/c001/1.html\" title=\"Toki wa... Vol.01 Ch.001 -Toki wa... target=\"_blank\"> Dis Be the link</a>\n" +
            "   </li> \n" +
            "</ul>" +
            " </body>\n" +
            "</html>"
            + "<head></head>";
    Document html = Jsoup.parse(htmlString);
    Elements elements = html.select("a");
    for(Element element: elements){
        System.out.println(element.attr("href"));
    }
}

Output:

/manga/toki_wa/v01/c001/1.html

Upvotes: 1

rzwitserloot
rzwitserloot

Reputation: 102933

Straight from jsoup.org, right there, first thing you see:

Document doc = Jsoup.connect("https://en.wikipedia.org/").get();
log(doc.title());
Elements newsHeadlines = doc.select("#mp-itn b a");
for (Element headline : newsHeadlines) {
  log("%s\n\t%s", 
    headline.attr("title"), headline.absUrl("href"));
}

Modifying this to what you need seems trivial:

Document doc = Jsoup.connect("https://en.wikipedia.org/").get();
Elements anchorTags = doc.select("ul.detail-main-list a");
for (Element anchorTag : anchorTags) {
  System.out.println("Links to: " + anchorTag.attr("href"));
  System.out.println("In absolute form: " + anchorTag.absUrl("href"));
  System.out.println("Text content: " + anchorTag.text());
}

The ul.detail-main-list a part is a so-called selector string. A real short tutorial on these:

  • foo means: Any HTML element with that tag name, i.e. <foo></foo>.
  • .bar means: Any HTML element with class bar, i.e. <foo class="bar baz"></foo>
  • #bar means: Any HTML element with id bar, i.e. <foo id="bar">
  • These can be combined: ul.detail-main-list matches any <ul> tags that have the string detail-main-list in their list of classes.
  • a b means: all things matching the 'b' selection that have something matching 'a' as a parent. So ul a matches all <a> tags that have a <ul> tag around them someplace.

The JSoup docs are excellent.

Upvotes: 2

Related Questions