No_name
No_name

Reputation: 2810

Jsoup get html between two tags

On a site like http://wikitravel.org/en/San_Francisco, sections like "Districts", "Understand", "Get in", etc don't actually contain the entire section in the HTML. Sections are actually just span classes in a heading. Because of this, one cannot grab certain sections of a wiki document simply by selecting the id.

However, is it possible to collect all the html between two tags? Say I wanted the "Get around" section. How would I issue a selector saying all html between

<h2><span class="editsection">[<a href="/wiki/en/index.php?title=San_Francisco&amp;action=edit&amp;section=15" title="Edit section: Get around">edit</a>]</span> <span class="mw-headline" id="Get_around">Get around</span></h2>

and

<h2><span class="editsection">[<a href="/wiki/en/index.php?title=San_Francisco&amp;action=edit&amp;section=22" title="Edit section: See">edit</a>][<a href="#See" title="click to add a see listing" onclick="addListing(this, '22', 'see', 'San_Francisco');">add listing</a>]</span> <span class="mw-headline" id="See">See</span></h2>

?

Upvotes: 3

Views: 3714

Answers (1)

Zach Shipley
Zach Shipley

Reputation: 1092

Ouch. That HTML is not very easy to work with. I take it you're probably doing some scraping, so I understand that sometimes this is the lot we're dealt. You tagged this , so I'll take a stab at it. There's no selector to work with fairly unstructured HTML like this normally. What you can do is select all next siblings of the first h2 and then remove all next siblings of the second h2. To add to the pain, we can only identify section headers by their text content so we'll need to use a :contains selector. Like this:

Document doc = Jsoup.connect("http://wikitravel.org/en/San_Francisco").get();
//select all "next siblings" of the "Get around" h2
Elements section = doc.select("h2:contains(Get around) ~ *");
//select all "next siblings" of the "See" h2 and remove them
section.select("h2:contains(See) ~ *").remove();
//remove the second h2
section.select("h2").remove();
//section now contains the elements between "Get around" and "See"
String sectionHtml = section.html();

Here's some Firebug output after doing the same thing with jQuery: The first selector returned an Elements object containing these Element 's:

[h3, p, p, p, p, h3, p, p, p, h3, div.thumb, div.thumb, p, ul, p, p, p, p, p, p, p, div.thumb, ul, ul, div.thumb, ul, ul, p, ul, ul, h3, p, p, p, h3, p, p, p, h3, p, p, p, p, p, p, h2, p, p, ul, ul, ul, h3, p, ul, h3, div.thumb, p, p, p, h3, div.thumb, p, p, p, p, p, h3, p, p, p, p, h3, div.thumb, p, p, p, p, p, h2, h3, div.thumb, p, p, p, p, p, ul, h3, div.thumb, ul, ul, ul, ul, ul, h3, p, h4, ul, h4, ul, h4, p, ul, h4, ul, h3, div.thumb, p, p, p, h3, p, h2, p, p, h2, p, p, p, h2, dl, p, p, p, p, h2, div.thumb, dl, p, p, p, h2, dl, h3, p, p, p, p, p, p, h3, p, ul, p, p, h2, dl, p, p, p, h2, p, p, p, p, h2, p, p, p, p, p, p, h2, p, p, p, p, h2, h3, ul, h3, ul, ul, ul, ul, ul, ul, ul, ul, ul, ul, ul, ul, ul, ul, ul, ul, ul, ul, ul, ul, ul, ul, ul, ul, ul, h2, p, p, ul, p, div.route_box, p, p, p, p, p, table, p, div, p, p, p, p]

Where the first h3 says "Navigating" and the last p contains a <br> (weird HTML, yeah). The second select and remove pared it down to this:

[h3, p, p, p, p, h3, p, p, p, h3, div.thumb, div.thumb, p, ul, p, p, p, p, p, p, p, div.thumb, ul, ul, div.thumb, ul, ul, p, ul, ul, h3, p, p, p, h3, p, p, p, h3, p, p, p, p, p, p, h2]

Where the first h3 is still the one saying "Navigating" and the last h2 is the "See" one you referenced. The select("h2") and remove resulted in this:

[h3, p, p, p, p, h3, p, p, p, h3, div.thumb, div.thumb, p, ul, p, p, p, p, p, p, p, div.thumb, ul, ul, div.thumb, ul, ul, p, ul, ul, h3, p, p, p, h3, p, p, p, h3, p, p, p, p, p, p]

Which contains all of the elements between the "Get around" h2 and the "See" h2.

Upvotes: 3

Related Questions