James H
James H

Reputation: 580

Webscrape the links on a webpage with JSOUP

I'm using JSoup to scrape a webpage. Can anybody help me out or point me in the right direction for how to parse the text that is contained in this link. presently I'm running a for each loop and it will iterate through the elements but won't find the link and stops after 1 iteration.

the HTML..

<div>
  <div style = a bunch of different inline styles here>
    <div class = "_6d3hm _mnav9">
      <div class = "_mck9w _gvoze _tn0ps">
        <a href= "the link i want">_</a>
      </div>
      <div class = "_mck9w _gvoze _tn0ps">
        <a href= "another link i want">_</a>
      </div>
      <div class = "_mck9w _gvoze _tn0ps">
        <a href= "another link i want">_</a>
      </div>
    </div>
    <div class = "_6d3hm _mnav9">
      <div class = "_mck9w _gvoze _tn0ps">
        <a href= "the link i want">_</a>
      </div>
      <div class = "_mck9w _gvoze _tn0ps">
        <a href= "another link i want">_</a>
      </div>
      <div class = "_mck9w _gvoze _tn0ps">
        <a href= "another link i want">_</a>
      </div>

This is my java using Soup. I've experimented with a bunch of different tags...

for (Element row : doc.select("div")) {
            System.out.println("iterating");
            final String link = row.getElementsByTag("._mck9w _gvoze _tn0ps").text();
            System.out.println(link);
        }

Does anybody have an idea how I can scrape every link i've mentioned in the HTML???

Upvotes: 0

Views: 107

Answers (1)

Luk
Luk

Reputation: 2246

The error is in this line: row.getElementsByTag("._mck9w _gvoze _tn0ps"). You are looking for tags a and its attribute href, so your code should look like this:

for (Element row : doc.select("div")) {
    System.out.println("iterating");
    final String link = row.getElementsByTag("a").attr("href");
    System.out.println(link);
}

If you want to use the fact, that div has class attribute with given values you can try something like this:

for(Element e: doc.select("div._mck9w._gvoze._tn0ps > a")) {
    System.out.println(e.attr("href"));
};

jsop documentation

more about css selectors

Upvotes: 2

Related Questions