yuno
yuno

Reputation: 1

How to get HTML between a pair of same tags using nokogiri?

I want to scrape a HTML file like that:

<div id="hoge">
  <h1><span>title 1</span></h1>

    <h2><span>subtitle 1-1</span></h2>
    <p></p>
    <table class="fuga"><span>data 1-1</span></table>
    <p></p>

    //(the same structure repeated n times)

    <h2><span>subtitle 1-(n+2)<span/></h2>
    <p></p>
    <table class="fuga"><span>data 1-(n+2)</span></table>
    <p></p>


  //(the same structure repeated m times)

  <h1><span>title m</span></h1>

    <h2><span>subtitle m-1</span></h2>
    <p></p>
    <table class="fuga"><span>data m-1</span></table>
    <p></p>

    //(the same structure repeated l times)

    <h2><span>subtitle m-(l+2)</span></h2>
    <p></p>
    <table class="fuga"><span>data m-(l+2)</span></table>
    <p></p>


</div>

I need values of table(in the example, represented in data x-y.) for each subtitle("subtitle x-y") for each title("title x").
To associate them, I want to cut <h1> ~ the last <p> before the next <h1>, but can't figure out how to do it.
I spent 5 hours to search, read, try and error, and finally came to write the code below, but it still don't work.
What's wrong? How can I cut the HTML?

require 'open-uri'
require 'nokogiri'

doc = Nokogiri::HTML(open("http://example.com/"))

doc.xpath('//div[@id="mw-content-text"]').each do |node|
  for i in 1..node.xpath('h1').length do
    mininode = node.xpath(%(node()[not(following-sibling::h1[#{i}] or preceding-sibling::h1[#{i+1}])]))

    title = mininode.xpath('h1/span').text
    puts title unless title.empty?
    puts "============"

    for j in 1..mininode.xpath('h2').length do
      puts mininode.xpath(%(h2[#{j}]/span)).text
      puts mininode.xpath(%(table[#{j}]/span)).text
    end
  end
end

Upvotes: 0

Views: 218

Answers (1)

the Tin Man
the Tin Man

Reputation: 160551

Meditate on this:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<div id="hoge">
  <h1><span>title 1</span></h1>

    <h2><span>subtitle 1-1</span></h2>
    <p></p>
    <table class="fuga"><span>data 1-1</span></table>
    <p></p>

    //(the same structure repeated n times)

    <h2><span>subtitle 1-(n+2)<span/></h2>
    <p></p>
    <table class="fuga"><span>data 1-(n+2)</span></table>
    <p></p>


  //(the same structure repeated m times)

  <h1><span>title m</span></h1>

    <h2><span>subtitle m-1</span></h2>
    <p></p>
    <table class="fuga"><span>data m-1</span></table>
    <p></p>

    //(the same structure repeated l times)

    <h2><span>subtitle m-(l+2)</span></h2>
    <p></p>
    <table class="fuga"><span>data m-(l+2)</span></table>
    <p></p>


</div>
EOT

Process the doc:

div = doc.at('#hoge')
h1_blocks = div.children.slice_before{ |node| node.name == 'h1' }.map{ |nodes| Nokogiri::XML::NodeSet.new(doc, nodes) }

Running that results in h1_blocks containing an array of NodeSets. Here's the first set based on your HTML:

h1_blocks[1].map(&:to_html)
# => ["<h1><span>title 1</span></h1>",
#     "\n\n    ",
#     "<h2><span>subtitle 1-1</span></h2>",
#     "\n    ",
#     "<p></p>",
#     "\n    ",
#     "<table class=\"fuga\"><span>data 1-1</span></table>",
#     "\n    ",
#     "<p></p>",
#     "\n\n    //(the same structure repeated n times)\n\n    ",
#     "<h2><span>subtitle 1-(n+2)<span></span></span></h2>",
#     "\n    ",
#     "<p></p>",
#     "\n    ",
#     "<table class=\"fuga\"><span>data 1-(n+2)</span></table>",
#     "\n    ",
#     "<p></p>",
#     "\n\n\n  //(the same structure repeated m times)\n\n  "]

Here's the second set, based on your HTML:

h1_blocks[2].map(&:to_html)
# => ["<h1><span>title m</span></h1>",
#     "\n\n    ",
#     "<h2><span>subtitle m-1</span></h2>",
#     "\n    ",
#     "<p></p>",
#     "\n    ",
#     "<table class=\"fuga\"><span>data m-1</span></table>",
#     "\n    ",
#     "<p></p>",
#     "\n\n    //(the same structure repeated l times)\n\n    ",
#     "<h2><span>subtitle m-(l+2)</span></h2>",
#     "\n    ",
#     "<p></p>",
#     "\n    ",
#     "<table class=\"fuga\"><span>data m-(l+2)</span></table>",
#     "\n    ",
#     "<p></p>",
#     "\n\n\n"]

How does this work?

Ruby's Enumerable class has slice_before which looks at a comparison, and for each true result, breaks the incoming array into a new sub-array. This is useful when we have a list of array elements and we have to break that array into separate chunks.

Often we use it when parsing text that has some sort of repeating blocks that we need to process as a chunk, such as paragraphs, network-device interfaces, etc.

Once the nodes are chunked by taking the children of the <div id="hoge"> tag, then they're passed into map which turns them back into NodeSets, making it easy to continue treating them like we would normally in Nokogiri.

Upvotes: 1

Related Questions