VitalyP
VitalyP

Reputation: 1867

XPath: how to get text from this and next tag?

i have HTML like this:

<h1>Hello1</h1>
<p>World1</p>
<h1>Hello2</h1>
<p>World2</p>
<h1>Hello2</h1>
<p>World2</p>

So i need to get at the one time Hello1 with World1, Hello2 with World2 etc

UPDATE: I use Ruby Mechanize library

Upvotes: 2

Views: 1398

Answers (2)

ezkl
ezkl

Reputation: 3861

The Ruby library "Mechanize" uses the Nokogiri parsing library, so you can call Nokogiri directly. One potential solution might look something like this:

require 'mechanize'
require 'pp'

html = "<h1>Hello1</h1>
<p>World1</p>
<h1>Hello2</h1>
<p>World2</p>
<h1>Hello2</h1>
<p>World2</p>"

results = []

Nokogiri::HTML(html).xpath("//h1").each do |header|
  p   = header.xpath("following-sibling::p[1]").text
  results << [header.text, p]
end

pp results

EDIT: This example was tested with Mechanize v2.0.1 which uses Nokogiri ~v1.4. I also tested directly against Nokogiri v1.5.0 without issue.

EDIT #2: This example answers a follow-up question to the original solution:

require 'nokogiri'
require 'pp'

html = <<HTML
<h1>
<p>
<font size="4">
<b>abide by (something)</b>
</font>
</p>
</h1>
<p>
<font size="3">- to follow the rules of something</font>
</p>
The cleaning staff must abide by the rules of the school.
<br>
<h1>
<p>
<font size="4">
<b>able to breathe easily again</b>
</font>
</p>
</h1>
<p>
My friend was able to breathe easily again when his company did not go bankrupt.
<br>
HTML

doc = Nokogiri::HTML(html)

results = []

Nokogiri::HTML(html).xpath("//h1").each do |header|
  h1   = header.xpath("following-sibling::p/font/b").text
  results << h1
end

pp results

H1 tags with nested elements are invalid, so Nokogiri corrects the error during the parsing process. The process to get at the formerly nested elements is very similar to the original solution.

Upvotes: 3

Alan W. Smith
Alan W. Smith

Reputation: 25465

Note: I glazed over the XPath part of this request. This answer is for an XSLT style sheet instead.

Expanding your XML example to give it a root element:

<?xml version="1.0" encoding="UTF-8"?>
<root>
    <h1>Hello1</h1>
    <p>World1</p>
    <h1>Hello2</h1>
    <p>World2</p>
    <h1>Hello3</h1>
    <p>World3</p>
</root>

You could use a for-each loop along with "following-sibling" to get the elements with something like this:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">

    <xsl:output encoding="UTF-8" method="text"/>

    <xsl:template match="/">

        <!-- start lookint for <h1> nodes -->
        <xsl:for-each select="/root/h1">

            <!-- output the h1 text -->
            <xsl:value-of select="."/>

            <!-- print a dash for spacing -->
            <xsl:text> - </xsl:text>

            <!-- select the next <p> node -->
            <xsl:value-of select="following-sibling::p[1]"/>

            <!-- print a new line -->
            <xsl:text>&#10;</xsl:text>

        </xsl:for-each>

    </xsl:template>

</xsl:stylesheet>

The output would look like this:

Hello1 - World1
Hello2 - World2
Hello3 - World3

Upvotes: 1

Related Questions