Reputation: 1701
In my Rails app I have HTML like the following, parsed in Nokogiri.
I want to be able to select chunks of HTML. For example, how can I select the block of HTML that's part of <sup id="21">
using XPath or CSS? Assume that in the real HTML the section with ********
does not exist.
I want to split the HTML by <sup id=*>
but the problem is that the nodes are siblings.
<sup class="v" id="20">
1
</sup>
this is some random text
<p></p>
more random text
<sup class="footnote" value='fn1'>
[v]
</sup>
# ****************************** starting here
<sup class="v" id="21">
2
</sup>
now this is a different section
<p></p>
how do we keep this separate
<sup class="footnote" value='fn2'>
[x]
</sup>
# ****************************** ending here
<sup class="v" id="23">
3
</sup>
this is yet another different section
<p></p>
how do we keep this separate too
<sup class="footnote" value='fn3'>
[r]
</sup>
Upvotes: 2
Views: 394
Reputation: 303244
Here's a simple solution that gives you NodeSet
s with all the nodes between <sup … class="v">
, hashed by their id
.
doc = Nokogiri.HTML(your_html)
nodes_by_vsup_id = Hash.new{ |k,v| k[v]=Nokogiri::XML::NodeSet.new(doc) }
last_id = nil
doc.at('body').children.each do |n|
last_id = n['id'] if n['class']=='v'
nodes_by_vsup_id[last_id] << n
end
puts nodes_by_vsup_id['21']
#=> <sup class="v" id="21">
#=> 2
#=> </sup>
#=>
#=> now this is a different section
#=> <p></p>
#=>
#=> how do we keep this separate
#=> <sup class="footnote" value="fn2">
#=> [x]
#=> </sup>
Alternatively, if you didn't really want the delimiting 'sup' to be part of the collection, instead do:
doc.at('body').elements.each do |n|
if n['class']=='v'
last_id = n['id']
else
nodes_by_vsup_id[last_id] << n
end
end
Here's an alternative, even-more-generic solution:
class Nokogiri::XML::NodeSet
# Yields each node in the set to your block
# Returns a hash keyed by whatever your block returns
# Any nodes that return nil/false are grouped with the previous valid value
def group_chunks
Hash.new{ |k,v| k[v] = self.class.new(document) }.tap do |result|
key = nil
each{ |n| result[key = yield(n) || key] << n }
end
end
end
root_items = doc.at('body').children
separated = root_items.group_chunks{ |node| node['class']=='v' && node['id'] }
puts separated['21']
Upvotes: 1
Reputation: 60414
It looks like you want to select everything between the sup
with @id='21'
and the sup
with @id='23'
. Use the following ad-hoc expression:
//sup[@id='21']|(//sup[@id='21']/following-sibling::node()[
not(self::sup[@id='23'] or preceding-sibling::sup[@id='23'])])
Or an application of the Kayessian node-set intersection formula:
//sup[@id='21']|(//sup[@id='21']/following-sibling::node()[
count(.|//sup[@id='23']/preceding-sibling::node())
=
count(//sup[@id='23']/preceding-sibling::node())])
Upvotes: 1
Reputation: 3905
require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML(open("http://www.yoururl"))
doc.xpath('//sup[id="21"]').each do |node|
puts node.text
end
Upvotes: -1