Reputation: 609

XPATH - stop scraping after a certain html element

I am using this XPATH query to try grab the first three items from the "ASQ Package Price":

//h2[contains(., 'ASQ Package Features')]/following-sibling::p

But it also grabs the other 3 items, so I end up with

Example 1 Example 2 Example 3 Example 4 Example 5 Example 6

I only want:

Example 1 Example 2 Example 3

How do I prevent XPATH from scraping the three I don't want - seems in this case it needs to stop at the <hr> tag?

<div itemprop="articleBody">

<h2>ASQ Package Price</h2>
<p class="">Example 1</p>
<p class="">Example 2</p>
<p class="">Example 3</p>

<hr>

<h2>ASQ Package Features&nbsp;</h2>

<p class="">Example 4</p>
<p class="">Example 5</p>
<p class="">Example 6</p>

</div>

Upvotes: 1

Answers (2)

Dimitre Novatchev

Reputation: 243599

Use:

     (//h2[starts-with(., 'ASQ Package')])[1]/following-sibling::hr[1]
                                                         /preceding-sibling::p

Verification with XSLT:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>

  <xsl:template match="/">
     <xsl:copy-of select=
     "(//h2[starts-with(., 'ASQ Package')])[1]
                    /following-sibling::hr[1]
                        /preceding-sibling::p"/>"/>
  </xsl:template>
</xsl:stylesheet>

When this transformation is run on the provided Html (adjusted to be syntactically correct XHtml):

<html>
    <div itemprop="articleBody">
        <h2>ASQ Package Price</h2>
        <p class="">Example 1</p>
        <p class="">Example 2</p>
        <p class="">Example 3</p>
        <hr />
        <h2>ASQ Package Features&#160;</h2>
        <p class="">Example 4</p>
        <p class="">Example 5</p>
        <p class="">Example 6</p>
    </div>
</html>

the XPath expression is evaluated, and all selected by it nodes are output:

<p class="">Example 1</p>
<p class="">Example 2</p>
<p class="">Example 3</p>

Explanation:

We need the preceding-sibling <p> elements only of the first <hr> following-sibling of the first<h2> in the document, whose string value starts with "ASQ Package", and

The first such <h2> element is selected by this XPath expression:

(//h2[starts-with(., 'ASQ Package Features')])[1]

Then we select its first following sibling <hr>:

    (//h2[starts-with(., 'ASQ Package Features')])[1]/following-sibling::hr[1]

Then we select all its preceding-sibling <p> elements:

 (//h2[starts-with(., 'ASQ Package')])[1]/following-sibling::hr[1]
                                                     /preceding-sibling::p

Upvotes: 2

Jack Fleeting

Reputation: 24940

Using xpath 2.0:

//h2/following-sibling::p intersect //hr/preceding-sibling::p

Using xpath 1.0:

//h2/following-sibling::p[not(preceding-sibling::hr)]

Upvotes: 0

XPATH - stop scraping after a certain html element

Answers (2)

Related Questions