How do I parse an HTML document with JSoup to get a list of links?

Question

I am trying to parse http://www.craigslist.org/about/sites to build a set of text/links to load a program dynamically with this information. So far I have done this:

Document doc = Jsoup.connect("http://www.craigslist.org/about/sites").get();
Elements elms = doc.select("div.colmask"); // gets 7 countries

Below this tag there are doc.select("div.state_delimiter,ul") tags I am trying to get. I setup my iterator and go into a while look and call iterator.next().outerHtml();. I see all the tags for each country.

How can I step through each div.state_delimiter, pull that text then go down until there is a which defines the end of the states individual counties/cities links/text?

I was playing around with this and can do it by setting outerHtml() to a String and then parsing the string manually, but I am sure there is an easier way to do this. I have tried text() and also tried attr("div.state_delimiter"), but I think I am messing up the pattern/routine to do this properly. Was wondering if someone could help me out here and show me how to get the div.state_delimiter into a text field and then the

I want all the under the for each state. Looking to grab the http:// && html that goes along with it as easy as possible.

BalusC · Accepted Answer

The

. You can use Element#nextElementSibling() to grab it from that div on. Here's a kickoff example:

Document document = Jsoup.connect("http://www.craigslist.org/about/sites").get();
Elements countries = document.select("div.colmask");

for (Element country : countries) {
    System.out.println("Country: " + country.select("h1.continent_header").text());
    Elements states = country.select("div.state_delimiter");

    for (Element state : states) {
        System.out.println("	State: " + state.text());
        Elements cities = state.nextElementSibling().select("li");

        for (Element city : cities) {
            System.out.println("		City: " + city.text());
        }
    }
}

The doc.select("div.state_delimiter,ul") doesn't do what you want. It returns all

and

How do I parse an HTML document with JSoup to get a list of links?

Answers (1)

Related Questions