mcchots
mcchots

Reputation: 516

Parse/Iterate html file with Hpricot/Nokogiri

I'm trying to parse an HTML file with the following format at the required section:

    <div style="something">
      <div class="link">
         <a href="http://..." class="headline">Headline</a>
      </div>
      <div class="text">
         Text summary is here
      </div>
      repeating...
   </div>

I want to output the headline followed by the text.

   HEADLINE
   Text goes here.

   HEADLINE
   Text goes here.

Currently I can search for the < a> tag with class="headline" and get a list and do the same with the text div. Then iterate through each to output the headline and text sequentially.

Can I get Hpricot/Nokogiri to save it in that order while it is parsing the file?

Upvotes: 0

Views: 744

Answers (1)

Mark Thomas
Mark Thomas

Reputation: 37507

Sure.

doc = Nokogiri::HTML(html)
doc.xpath('//a[@class="headline"]').each do |headline|
  puts headline.text
  puts headline.xpath('../following-sibling::div[1]').text
end

Upvotes: 3

Related Questions