Reputation: 1448
I want to extract text in <p>
between the div
tag 'Heading1' and the next div
tag, in the example below. I can't used 'heading2 to isolate the next div
as this text may change.
library(XML)
# create example html
html <- '
<div class="AAA">
<div class="AAA">Heading1</div>
</div>
<p>text1 I want</p>
<p>text2 I want</p>
<p>text3 I want</p>
<div class="AAA">
<div class="AAA">Heading2</div> <!-- Do not always know what this heading is -->
</div>
<p>more text</p>
<p>more text</p>
<p>more text</p>
<div class="AAA">
<div class="AAA">Heading3</div>
</div>'
doc <- htmlParse(html)
xpath <- "//p[preceding::div[@class='AAA' and contains(., 'Heading1')]]"
xpathSApply(doc, xpath, xmlValue)
This works up to here, but I'm stuck with limiting the xpath at the next div. I have tried using the following, thinking it would get the next div
.
"//p[preceding::div[@class='AAA' and contains(., 'Heading1')]and following::div[position()=1]]"
Upvotes: 0
Views: 259
Reputation: 52878
I don't think it's necessary to test the next div. You should be able to do something like this...
//p[preceding-sibling::div[1][normalize-space()='Heading1']]
or this if the class matters...
//p[preceding-sibling::div[1][@class='AAA'][normalize-space()='Heading1']]
or this if you need to still use contains()
...
//p[preceding-sibling::div[1][@class='AAA'][contains(normalize-space(),'Heading1')]]
Upvotes: 2
Reputation: 4869
Try this one
//p[preceding-sibling::div[div="Heading1"] and count(preceding-sibling::div[div])=1]
Upvotes: -1