CCID
CCID

Reputation: 1448

Xpath to extract text between specific div tag and next div

I want to extract text in <p> between the div tag 'Heading1' and the next div tag, in the example below. I can't used 'heading2 to isolate the next div as this text may change.

library(XML)
# create example html
html <- '
<div class="AAA">
<div class="AAA">Heading1</div>
</div>
<p>text1 I want</p>
<p>text2 I want</p>
<p>text3 I want</p>
<div class="AAA">
<div class="AAA">Heading2</div> <!-- Do not always know what this heading is -->
</div>
<p>more text</p>
<p>more text</p>
<p>more text</p>
<div class="AAA">
<div class="AAA">Heading3</div>
</div>'

doc <- htmlParse(html)

xpath <- "//p[preceding::div[@class='AAA' and contains(., 'Heading1')]]"

xpathSApply(doc, xpath, xmlValue)

This works up to here, but I'm stuck with limiting the xpath at the next div. I have tried using the following, thinking it would get the next div.

"//p[preceding::div[@class='AAA' and contains(., 'Heading1')]and following::div[position()=1]]"

Upvotes: 0

Views: 259

Answers (2)

Daniel Haley
Daniel Haley

Reputation: 52878

I don't think it's necessary to test the next div. You should be able to do something like this...

//p[preceding-sibling::div[1][normalize-space()='Heading1']]

or this if the class matters...

//p[preceding-sibling::div[1][@class='AAA'][normalize-space()='Heading1']]

or this if you need to still use contains()...

//p[preceding-sibling::div[1][@class='AAA'][contains(normalize-space(),'Heading1')]]

Upvotes: 2

JaSON
JaSON

Reputation: 4869

Try this one

//p[preceding-sibling::div[div="Heading1"] and count(preceding-sibling::div[div])=1]

Upvotes: -1

Related Questions