Reputation: 56904
I am trying to use JSoup to scrape some content off of a website. Here is some sample HTML content from the page I am interested in:
<div class="sep_top shd_hdr pb2 luna">
<div class="KonaBody" style="padding-left:0px;">
<div class="lunatext results_content frstluna">
<div class="luna-Ent">
<div class="header">
<div class="body">
<div class="pbk">
<div id="rltqns">
<div class="pbk">
<span class="pg">
<span id="hotword">
<span id="hotword">Fizz</span>
</span>
</span>
<div class="luna-Ent">
<div class="luna-Ent">
<div class="luna-Ent">
<div class="luna-Ent">
</div>
<div class="pbk">
<span class="sectionLabel">
<span class="pg">
<span id="hotword">
<span id="hotword">Buzz</span>
</span>
</span>
<span class="pg">
<span id="hotword">
<span id="hotword">Foo</span>
</span>
</span>
<span class="pg">
<span id="hotword">
<span id="hotword">Bar</span>
</span>
</span>
</div>
<div class="tail">
</div>
<div class="rcr">
<!-- ... rest of content omitted for brevity -->
I am interested in obtaining a list of all the hotwords
in the page (so "Fizz", "Buzz", "Foo" and "Bar"). But I can't just query for hotword
, because they use the hotword
class all over the place to decorate lots of different elements. Specifically, I need all the hotwords
that exist inside a pbk pg hotword
element. Note that pbks can contain 0+ pgs, and pgs can contain 0+ hotwords, and hotwords can contain 1+ other hotwords. I have the following code:
// Update, per PShemo:
Document doc = Jsoup.connect("http://somesite.example.com").get();
System.out.println("Starting to crawl...");
// Get the document's .pbk elements.
Elements pbks = doc.select(".pbk");
List<String> hotwords = new ArrayList<String>();
System.out.println(String.format("Found %s pbks.", pbks.size()));
int pbkCount = 0;
for(Element pbk : pbks) {
pbkCount++;
// Get the .pbk element's .pg elements.
for(Element pg : pbk.getElementsByClass("pg")) {
System.out.println(String.format("PBK #%s has %s pgs.", pbkCount, pbk.getElementsByClass("pg").size()));
Element hotword = pg.getElementById("hotword");
System.out.println("Adding hotword: " + hotword.text());
hotwords.add(hotword.text());
}
}
Running that code produces the following output:
Starting to crawl...
Found 3 pbks.
I am either not using the JSoup API correctly, or not using the right selectors, or both. Any thoughts as to where I'm going awry?
Upvotes: 0
Views: 5237
Reputation: 9813
Elements hotwords = document.select("#hotwords");
for (Element hotword : hotwords){
String word = hotword.getText();
}
Upvotes: 0
Reputation: 124225
If you are using getElementsByClass
then you don't need to add .
before it, just use class name like getElementsByClass("pg")
, not getElementsByClass(".pg")
Same goes to getElementById
. Don't add #
before id
value. Just use getElementById("hotword")
.
Also it seems that your div
s with pbk
class are nested so getElementsByClass
could give you duplicate results.
After knowing what page you are trying to parse you can do it with one selector. Try maybe this way
for (Element element:doc.select("div.body div.pbk span.pg")){
System.out.println(element.text());
}
Upvotes: 2