Reputation: 763
I have HTML like this:
<h1> Header is here</h1>
<h2>Header 2 is here</h2>
<p> Extract me!</p>
<p> Extract me too!</p>
<h2> Next Header 2</h2>
<p>not interested</p>
<p>not interested</p>
<h2>Header 2 is here</h2>
<p> Extract me!</p>
<p> Extract me too!</p>
I have a basic Nokogiri CSS node search returning <p> content but I can't find examples for how to target all text between the Nth closed H2 and the next open H2. I'm creating a CSV with the output so I would also like to read in a file list and put the URL as first result.
Upvotes: 5
Views: 5924
Reputation: 54984
You can sometimes use NodeSet's & operator to get information between nodes:
doc.xpath('//h2[1]/following-sibling::p') & doc.xpath('//h2[2]/preceding-sibling::p')
Upvotes: 3
Reputation: 303234
Instead of an XPath solution, here's a simple (naïve) implementation that assumes that the start and stop elements share the same parent and allows the XPaths for start and stop to be specified independently:
HTML = "<h1>Header is here</h1>
<h2>Header 2 is here</h2>
<p>Extract me!</p>
<p>Extract me too!</p>
<h2> Next Header 2</h2>
<p>not interested</p>
<p>not interested</p>
<h2>Header 2 is here</h2>
<p>Extract me three!</p>
<p>Extract me four!</p>"
require 'nokogiri'
class Nokogiri::XML::Node
# Naive implementation; assumes found elements will share the same parent
def content_between( start_xpath, stop_xpath=nil )
node = at_xpath(start_xpath).next_element
stop = stop_xpath && at_xpath(stop_xpath)
[].tap do |content|
while node && node!=stop
content << node
node = node.next_element
end
end
end
end
html = Nokogiri::HTML(HTML)
puts html.content_between('//h2[1]','//h2[2]').map(&:content)
#=> Extract me!
#=> Extract me too!
puts html.content_between('//h2[3]').map(&:content)
#=> Extract me three!
#=> Extract me four!
Upvotes: 2
Reputation: 303234
If the start and stop elements have the same parent, this is as simple as a single XPath. First I'll show it with a simplified document for clarity, and then with your sample document:
XML = "<root>
<a/><a1/><a2/>
<b/><b1/><b2/>
<c/><c1/><c2/>
</root>"
require 'nokogiri'
xml = Nokogiri::XML(XML)
# Find all elements between 'a' and 'c'
p xml.xpath('//*[preceding-sibling::a][following-sibling::c]').map(&:name)
#=> ["a1", "a2", "b", "b1", "b2"]
# Find all elements between 'a' and 'b'
p xml.xpath('//*[preceding-sibling::a][following-sibling::b]').map(&:name)
#=> ["a1", "a2"]
# Find all elements after 'c'
p xml.xpath('//*[preceding-sibling::c]').map(&:name)
#=> ["c1", "c2"]
Now, here it is with your use case (finding by index):
HTML = "<h1> Header is here</h1>
<h2>Header 2 is here</h2>
<p>Extract me!</p>
<p>Extract me too!</p>
<h2> Next Header 2</h2>
<p>not interested</p>
<p>not interested</p>
<h2>Header 2 is here</h2>
<p>Extract me three!</p>
<p>Extract me four!</p>"
require 'nokogiri'
html = Nokogiri::HTML(HTML)
# Find all elements between the first and second h2s
p html.xpath('//*[preceding-sibling::h2[1]][following-sibling::h2[2]]').map(&:content)
#=> ["Extract me!", "Extract me too!"]
# Find all elements between the third h2 and the end
p html.xpath('//*[preceding-sibling::h2[3]]').map(&:content)
#=> ["Extract me three!", "Extract me four!"]
Upvotes: 2
Reputation: 757
require 'rubygems'
require 'nokogiri'
h = '<h1> Header is here</h1>
<h2>Header 2 is here</h2>
<p> Extract me!</p>
<p> Extract me too!</p>
<h2> Next Header 2</h2>
<p>not interested</p>
<p>not interested</p>
<h2>Header 2 is here</h2>
<p> Extract me!</p>
<p> Extract me too!</p>
'
doc = Nokogiri::HTML(h)
# Specify the range between delimiter tags that you want to extract
# triple dot is used to exclude the end point
# 1...2 means 1 and not 2
EXTRACT_RANGES = [
2...3,
4...5
]
# Tags which count as delimiters, not to be extracted
DELIMITER_TAGS = [
"h1",
"h2"
]
extracted_text = []
i = 0
# Change /"html"/"body" to the correct path of the tag which contains this list
(doc/"html"/"body").children.each do |el|
if (DELIMITER_TAGS.include? el.name)
i += 1
else
extract = false
EXTRACT_RANGES.each do |cur_range|
if (cur_range.include? i)
extract = true
break
end
end
if extract
s = el.inner_text.strip
unless s.empty?
extracted_text << el.inner_text.strip
end
end
end
end
# Print out extracted text (each element's inner text is separated by newlines)
puts extracted_text.join("\n")
Upvotes: 4
Reputation:
This code may help you, but it stil needed more information about tags location (it's better if you info which needs to be extracted will be located between some tags)
require 'rubygems'
require 'nokogiri'
require 'pp'
html = '<h1> Header is here</h1>
<h2>Header 2 is here</h2>
<p> Extract me!</p>
<p> Extract me too!</p>
<h2> Next Header 2</h2>
<p>not interested</p>
<p>not interested</p>
<h2>Header 2 is here</h2>
<p> Extract me!</p>
<p> Extract me too!</p>
';
doc = Nokogiri::HTML(html);
doc.xpath("//p").each do |el|
pp el
end
Upvotes: 1