Reputation: 1
I want to scrape a HTML file like that:
<div id="hoge">
<h1><span>title 1</span></h1>
<h2><span>subtitle 1-1</span></h2>
<p></p>
<table class="fuga"><span>data 1-1</span></table>
<p></p>
//(the same structure repeated n times)
<h2><span>subtitle 1-(n+2)<span/></h2>
<p></p>
<table class="fuga"><span>data 1-(n+2)</span></table>
<p></p>
//(the same structure repeated m times)
<h1><span>title m</span></h1>
<h2><span>subtitle m-1</span></h2>
<p></p>
<table class="fuga"><span>data m-1</span></table>
<p></p>
//(the same structure repeated l times)
<h2><span>subtitle m-(l+2)</span></h2>
<p></p>
<table class="fuga"><span>data m-(l+2)</span></table>
<p></p>
</div>
I need values of table(in the example, represented in data x-y
.) for each subtitle("subtitle x-y"
) for each title("title x"
).
To associate them, I want to cut <h1>
~ the last <p>
before the next <h1>
, but can't figure out how to do it.
I spent 5 hours to search, read, try and error, and finally came to write the code below, but it still don't work.
What's wrong? How can I cut the HTML?
require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML(open("http://example.com/"))
doc.xpath('//div[@id="mw-content-text"]').each do |node|
for i in 1..node.xpath('h1').length do
mininode = node.xpath(%(node()[not(following-sibling::h1[#{i}] or preceding-sibling::h1[#{i+1}])]))
title = mininode.xpath('h1/span').text
puts title unless title.empty?
puts "============"
for j in 1..mininode.xpath('h2').length do
puts mininode.xpath(%(h2[#{j}]/span)).text
puts mininode.xpath(%(table[#{j}]/span)).text
end
end
end
Upvotes: 0
Views: 218
Reputation: 160551
Meditate on this:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<div id="hoge">
<h1><span>title 1</span></h1>
<h2><span>subtitle 1-1</span></h2>
<p></p>
<table class="fuga"><span>data 1-1</span></table>
<p></p>
//(the same structure repeated n times)
<h2><span>subtitle 1-(n+2)<span/></h2>
<p></p>
<table class="fuga"><span>data 1-(n+2)</span></table>
<p></p>
//(the same structure repeated m times)
<h1><span>title m</span></h1>
<h2><span>subtitle m-1</span></h2>
<p></p>
<table class="fuga"><span>data m-1</span></table>
<p></p>
//(the same structure repeated l times)
<h2><span>subtitle m-(l+2)</span></h2>
<p></p>
<table class="fuga"><span>data m-(l+2)</span></table>
<p></p>
</div>
EOT
Process the doc
:
div = doc.at('#hoge')
h1_blocks = div.children.slice_before{ |node| node.name == 'h1' }.map{ |nodes| Nokogiri::XML::NodeSet.new(doc, nodes) }
Running that results in h1_blocks
containing an array of NodeSets. Here's the first set based on your HTML:
h1_blocks[1].map(&:to_html)
# => ["<h1><span>title 1</span></h1>",
# "\n\n ",
# "<h2><span>subtitle 1-1</span></h2>",
# "\n ",
# "<p></p>",
# "\n ",
# "<table class=\"fuga\"><span>data 1-1</span></table>",
# "\n ",
# "<p></p>",
# "\n\n //(the same structure repeated n times)\n\n ",
# "<h2><span>subtitle 1-(n+2)<span></span></span></h2>",
# "\n ",
# "<p></p>",
# "\n ",
# "<table class=\"fuga\"><span>data 1-(n+2)</span></table>",
# "\n ",
# "<p></p>",
# "\n\n\n //(the same structure repeated m times)\n\n "]
Here's the second set, based on your HTML:
h1_blocks[2].map(&:to_html)
# => ["<h1><span>title m</span></h1>",
# "\n\n ",
# "<h2><span>subtitle m-1</span></h2>",
# "\n ",
# "<p></p>",
# "\n ",
# "<table class=\"fuga\"><span>data m-1</span></table>",
# "\n ",
# "<p></p>",
# "\n\n //(the same structure repeated l times)\n\n ",
# "<h2><span>subtitle m-(l+2)</span></h2>",
# "\n ",
# "<p></p>",
# "\n ",
# "<table class=\"fuga\"><span>data m-(l+2)</span></table>",
# "\n ",
# "<p></p>",
# "\n\n\n"]
How does this work?
Ruby's Enumerable class has slice_before
which looks at a comparison, and for each true result, breaks the incoming array into a new sub-array. This is useful when we have a list of array elements and we have to break that array into separate chunks.
Often we use it when parsing text that has some sort of repeating blocks that we need to process as a chunk, such as paragraphs, network-device interfaces, etc.
Once the nodes are chunked by taking the children of the <div id="hoge">
tag, then they're passed into map
which turns them back into NodeSets, making it easy to continue treating them like we would normally in Nokogiri.
Upvotes: 1