John Smith
John Smith

Reputation: 1698

How to identify a node with its XML value in XPath?

I use R to scrape a web site, and when parsing the HTML code, I have this code below:

    <div class="line">
        <h2 class="clearfix">
            <span class="property">Number<div>number extra</div></span>
            <span class="value">3</span>
        </h2>
    </div>
    <div class="line">
        <h2 class="clearfix">
            <span class="property">Surface</span>
            <span class="value">72</span>
        </h2>
    </div>

Now I would like to get some values in this code.

Here's I started to do: First, I extract all "property":

xpathApply(page, '//span[@class="property"]',xmlValue)

Then I extract all "value":

xpathApply(page, '//span[@class="value"]',xmlValue)

Then I build a list or a matrix, so that I can identify the value of "Surface", which is 72. But the problem is that sometimes, a span with class="property" can not have a span with class="value" that just follows in a h2. So I can not build a proper list.

Could this be the most efficient way? Identify the span with class="property", then identify the h2 that contains this span, then identify the span with class="value"?

Upvotes: 2

Views: 408

Answers (2)

alistaire
alistaire

Reputation: 43354

XPath can evaluate the contents of a tag using its own function text(). Using rvest for simplicity:

library(rvest)

html <- '<div class="line">
        <h2 class="clearfix">
<span class="property">Number<div>number extra</div></span>
<span class="value">3</span>
</h2>
</div>
<div class="line">
<h2 class="clearfix">
<span class="property">Surface</span>
<span class="value">72</span>
</h2>
</div>' 

html %>% read_html() %>%    # read html
    html_nodes(xpath = '//span[text()="Number"]/*') %>%    # select node
    html_text()    # get text contents of node
# [1] "number extra"

XPath also has selectors to follow family axes, in this case following:::

html %>% read_html() %>%    # read html
    html_nodes(xpath = '//span[text()="Surface"]/following::*') %>%    # select node
    html_text()    # get text contents of node
# [1] "72"

Upvotes: 1

kjhughes
kjhughes

Reputation: 111686

For your HTML made to be well-formed by adding a single root element,

<?xml version="1.0" encoding="UTF-8"?>
<r> 
  <div class="line"> 
    <h2 class="clearfix"> 
      <span class="property">Number
        <div>number extra</div>
      </span>  
      <span class="value">3</span> 
    </h2> 
  </div>  
  <div class="line"> 
    <h2 class="clearfix"> 
      <span class="property">Surface</span>  
      <span class="value">72</span> 
    </h2> 
  </div> 
</r>

(A) This XPath expression,

//span[@class='property' and starts-with(., 'Number')]/div/text()

will return

number extra

as requested.


(B) This XPath expression,

//h2[span[@class='property' and . = 'Surface']]/span[@class='value']/text()

will return

72

as requested.

Upvotes: 1

Related Questions