Reputation: 2810
On a site like http://wikitravel.org/en/San_Francisco, sections like "Districts", "Understand", "Get in", etc don't actually contain the entire section in the HTML. Sections are actually just span classes in a heading. Because of this, one cannot grab certain sections of a wiki document simply by selecting the id.
However, is it possible to collect all the html between two tags? Say I wanted the "Get around" section. How would I issue a selector saying all html between
<h2><span class="editsection">[<a href="/wiki/en/index.php?title=San_Francisco&action=edit&section=15" title="Edit section: Get around">edit</a>]</span> <span class="mw-headline" id="Get_around">Get around</span></h2>
and
<h2><span class="editsection">[<a href="/wiki/en/index.php?title=San_Francisco&action=edit&section=22" title="Edit section: See">edit</a>][<a href="#See" title="click to add a see listing" onclick="addListing(this, '22', 'see', 'San_Francisco');">add listing</a>]</span> <span class="mw-headline" id="See">See</span></h2>
?
Upvotes: 3
Views: 3714
Reputation: 1092
Ouch. That HTML is not very easy to work with. I take it you're probably doing some scraping, so I understand that sometimes this is the lot we're dealt. You tagged this jsoup, so I'll take a stab at it. There's no selector to work with fairly unstructured HTML like this normally. What you can do is select all next siblings of the first h2 and then remove all next siblings of the second h2. To add to the pain, we can only identify section headers by their text content so we'll need to use a :contains
selector. Like this:
Document doc = Jsoup.connect("http://wikitravel.org/en/San_Francisco").get();
//select all "next siblings" of the "Get around" h2
Elements section = doc.select("h2:contains(Get around) ~ *");
//select all "next siblings" of the "See" h2 and remove them
section.select("h2:contains(See) ~ *").remove();
//remove the second h2
section.select("h2").remove();
//section now contains the elements between "Get around" and "See"
String sectionHtml = section.html();
Here's some Firebug output after doing the same thing with jQuery: The first selector returned an Elements object containing these Element 's:
[h3, p, p, p, p, h3, p, p, p, h3, div.thumb, div.thumb, p, ul, p, p, p, p, p, p, p, div.thumb, ul, ul, div.thumb, ul, ul, p, ul, ul, h3, p, p, p, h3, p, p, p, h3, p, p, p, p, p, p, h2, p, p, ul, ul, ul, h3, p, ul, h3, div.thumb, p, p, p, h3, div.thumb, p, p, p, p, p, h3, p, p, p, p, h3, div.thumb, p, p, p, p, p, h2, h3, div.thumb, p, p, p, p, p, ul, h3, div.thumb, ul, ul, ul, ul, ul, h3, p, h4, ul, h4, ul, h4, p, ul, h4, ul, h3, div.thumb, p, p, p, h3, p, h2, p, p, h2, p, p, p, h2, dl, p, p, p, p, h2, div.thumb, dl, p, p, p, h2, dl, h3, p, p, p, p, p, p, h3, p, ul, p, p, h2, dl, p, p, p, h2, p, p, p, p, h2, p, p, p, p, p, p, h2, p, p, p, p, h2, h3, ul, h3, ul, ul, ul, ul, ul, ul, ul, ul, ul, ul, ul, ul, ul, ul, ul, ul, ul, ul, ul, ul, ul, ul, ul, ul, ul, h2, p, p, ul, p, div.route_box, p, p, p, p, p, table, p, div, p, p, p, p]
Where the first h3
says "Navigating" and the last p
contains a <br>
(weird HTML, yeah). The second select and remove pared it down to this:
[h3, p, p, p, p, h3, p, p, p, h3, div.thumb, div.thumb, p, ul, p, p, p, p, p, p, p, div.thumb, ul, ul, div.thumb, ul, ul, p, ul, ul, h3, p, p, p, h3, p, p, p, h3, p, p, p, p, p, p, h2]
Where the first h3
is still the one saying "Navigating" and the last h2
is the "See" one you referenced. The select("h2") and remove resulted in this:
[h3, p, p, p, p, h3, p, p, p, h3, div.thumb, div.thumb, p, ul, p, p, p, p, p, p, p, div.thumb, ul, ul, div.thumb, ul, ul, p, ul, ul, h3, p, p, p, h3, p, p, p, h3, p, p, p, p, p, p]
Which contains all of the elements between the "Get around" h2
and the "See" h2
.
Upvotes: 3