How to identify a node with its XML value in XPath?

Question

I use R to scrape a web site, and when parsing the HTML code, I have this code below:

    
        
            Numbernumber extra
            3
        
    
    
        
            Surface
            72

Now I would like to get some values in this code.

How to identify the span with the xml value "Number". and get the node, in order to extract "number extra"?
I know how to use xpathApply to identify nodes in order to get the xmlValue or some attributes (like href with xmlGetAttr). But I don't know how to identify a node with knowing its xmlvalue.
```
  xpathApply(page, '//span[@class="property"]',xmlValue)
```
If I want to get the "value" 72 for the property class "Surface", what is the most efficient way?

Here's I started to do: First, I extract all "property":

xpathApply(page, '//span[@class="property"]',xmlValue)

Then I extract all "value":

xpathApply(page, '//span[@class="value"]',xmlValue)

Then I build a list or a matrix, so that I can identify the value of "Surface", which is 72. But the problem is that sometimes, a span with class="property" can not have a span with class="value" that just follows in a h2. So I can not build a proper list.

Could this be the most efficient way? Identify the span with class="property", then identify the h2 that contains this span, then identify the span with class="value"?

kjhughes · Accepted Answer

For your HTML made to be well-formed by adding a single root element,


 
   
     
      Number
        number extra
        
      3 
     
    
   
     
      Surface  
      72

(A) This XPath expression,

//span[@class='property' and starts-with(., 'Number')]/div/text()

will return

number extra

as requested.

(B) This XPath expression,

//h2[span[@class='property' and . = 'Surface']]/span[@class='value']/text()

will return

as requested.

How to identify a node with its XML value in XPath?

Answers (2)

Related Questions