Reputation: 1867
i have HTML like this:
<h1>Hello1</h1>
<p>World1</p>
<h1>Hello2</h1>
<p>World2</p>
<h1>Hello2</h1>
<p>World2</p>
So i need to get at the one time Hello1 with World1, Hello2 with World2 etc
UPDATE: I use Ruby Mechanize library
Upvotes: 2
Views: 1398
Reputation: 3861
The Ruby library "Mechanize" uses the Nokogiri parsing library, so you can call Nokogiri directly. One potential solution might look something like this:
require 'mechanize'
require 'pp'
html = "<h1>Hello1</h1>
<p>World1</p>
<h1>Hello2</h1>
<p>World2</p>
<h1>Hello2</h1>
<p>World2</p>"
results = []
Nokogiri::HTML(html).xpath("//h1").each do |header|
p = header.xpath("following-sibling::p[1]").text
results << [header.text, p]
end
pp results
EDIT: This example was tested with Mechanize v2.0.1 which uses Nokogiri ~v1.4. I also tested directly against Nokogiri v1.5.0 without issue.
EDIT #2: This example answers a follow-up question to the original solution:
require 'nokogiri'
require 'pp'
html = <<HTML
<h1>
<p>
<font size="4">
<b>abide by (something)</b>
</font>
</p>
</h1>
<p>
<font size="3">- to follow the rules of something</font>
</p>
The cleaning staff must abide by the rules of the school.
<br>
<h1>
<p>
<font size="4">
<b>able to breathe easily again</b>
</font>
</p>
</h1>
<p>
My friend was able to breathe easily again when his company did not go bankrupt.
<br>
HTML
doc = Nokogiri::HTML(html)
results = []
Nokogiri::HTML(html).xpath("//h1").each do |header|
h1 = header.xpath("following-sibling::p/font/b").text
results << h1
end
pp results
H1
tags with nested elements are invalid, so Nokogiri corrects the error during the parsing process. The process to get at the formerly nested elements is very similar to the original solution.
Upvotes: 3
Reputation: 25465
Note: I glazed over the XPath part of this request. This answer is for an XSLT style sheet instead.
Expanding your XML example to give it a root element:
<?xml version="1.0" encoding="UTF-8"?>
<root>
<h1>Hello1</h1>
<p>World1</p>
<h1>Hello2</h1>
<p>World2</p>
<h1>Hello3</h1>
<p>World3</p>
</root>
You could use a for-each loop along with "following-sibling" to get the elements with something like this:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output encoding="UTF-8" method="text"/>
<xsl:template match="/">
<!-- start lookint for <h1> nodes -->
<xsl:for-each select="/root/h1">
<!-- output the h1 text -->
<xsl:value-of select="."/>
<!-- print a dash for spacing -->
<xsl:text> - </xsl:text>
<!-- select the next <p> node -->
<xsl:value-of select="following-sibling::p[1]"/>
<!-- print a new line -->
<xsl:text> </xsl:text>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
The output would look like this:
Hello1 - World1
Hello2 - World2
Hello3 - World3
Upvotes: 1