Using JSoup CSS selectors

Question

I am trying to use JSoup to scrape some content off of a website. Here is some sample HTML content from the page I am interested in:


I am interested in obtaining a list of all the hotwords in the page (so "Fizz", "Buzz", "Foo" and "Bar"). But I can't just query for hotword, because they use the hotword class all over the place to decorate lots of different elements. Specifically, I need all the hotwords that exist inside a pbk pg hotword element. Note that pbks can contain 0+ pgs, and pgs can contain 0+ hotwords, and hotwords can contain 1+ other hotwords. I have the following code:
// Update, per PShemo:
Document doc = Jsoup.connect("http://somesite.example.com").get();

System.out.println("Starting to crawl...");

// Get the document's .pbk elements.
Elements pbks = doc.select(".pbk");

List hotwords = new ArrayList();

System.out.println(String.format("Found %s pbks.", pbks.size()));
int pbkCount = 0;
for(Element pbk : pbks) {
    pbkCount++;

    // Get the .pbk element's .pg elements.
    for(Element pg : pbk.getElementsByClass("pg")) {
        System.out.println(String.format("PBK #%s has %s pgs.", pbkCount, pbk.getElementsByClass("pg").size()));
        Element hotword = pg.getElementById("hotword");

        System.out.println("Adding hotword: " + hotword.text());
        hotwords.add(hotword.text());
    }
}

Running that code produces the following output:
Starting to crawl...
Found 3 pbks.

I am either not using the JSoup API correctly, or not using the right selectors, or both. Any thoughts as to where I'm going awry?

Pshemo · Accepted Answer

If you are using getElementsByClass then you don't need to add . before it, just use class name like getElementsByClass("pg"), not getElementsByClass(".pg")

Same goes to getElementById. Don't add # before id value. Just use getElementById("hotword").

Also it seems that your divs with pbk class are nested so getElementsByClass could give you duplicate results.

After knowing what page you are trying to parse you can do it with one selector. Try maybe this way

for (Element element:doc.select("div.body div.pbk span.pg")){
    System.out.println(element.text());
}

Using JSoup CSS selectors

Answers (2)

Related Questions