Reputation: 136
I am trying to parse http://www.craigslist.org/about/sites to build a set of text/links to load a program dynamically with this information. So far I have done this:
Document doc = Jsoup.connect("http://www.craigslist.org/about/sites").get();
Elements elms = doc.select("div.colmask"); // gets 7 countries
Below this tag there are doc.select("div.state_delimiter,ul")
tags I am trying to get. I setup my iterator and go into a while look and call iterator.next().outerHtml();
. I see all the tags for each country.
How can I step through each div.state_delimiter
, pull that text then go down until
there is a </ul>
which defines the end of the states individual counties/cities links/text?
I was playing around with this and can do it by setting outerHtml()
to a String
and then parsing the string manually, but I am sure there is an easier way to do this. I have tried text()
and also tried attr("div.state_delimiter")
, but I think I am messing up the pattern/routine to do this properly. Was wondering if someone could help me out here and show me how to get the div.state_delimiter into a text field and then the <ul><li></li></ul>
I want all the <li></li>
under the <ul></ul>
for each state. Looking to grab the http:// && html that goes along with it as easy as possible.
Upvotes: 4
Views: 4268
Reputation: 1108732
The <ul>
containing the cities is the next sibling of the <div class="state_delimiter">
. You can use Element#nextElementSibling()
to grab it from that div on. Here's a kickoff example:
Document document = Jsoup.connect("http://www.craigslist.org/about/sites").get();
Elements countries = document.select("div.colmask");
for (Element country : countries) {
System.out.println("Country: " + country.select("h1.continent_header").text());
Elements states = country.select("div.state_delimiter");
for (Element state : states) {
System.out.println("\tState: " + state.text());
Elements cities = state.nextElementSibling().select("li");
for (Element city : cities) {
System.out.println("\t\tCity: " + city.text());
}
}
}
The doc.select("div.state_delimiter,ul")
doesn't do what you want. It returns all <div class="state_delimiter">
and <ul>
elements of the document. Manually parsing it by string functions makes no sense if you've already a HTML parser at hands.
Upvotes: 7