Reputation: 1698
I use R to scrape a web site, and when parsing the HTML code, I have this code below:
<div class="line">
<h2 class="clearfix">
<span class="property">Number<div>number extra</div></span>
<span class="value">3</span>
</h2>
</div>
<div class="line">
<h2 class="clearfix">
<span class="property">Surface</span>
<span class="value">72</span>
</h2>
</div>
Now I would like to get some values in this code.
How to identify the span with the xml value "Number". and get the node, in order to extract "number extra"?
I know how to use xpathApply to identify nodes in order to get the xmlValue or some attributes (like href
with xmlGetAttr
). But I don't know how to identify a node with knowing its xmlvalue.
xpathApply(page, '//span[@class="property"]',xmlValue)
If I want to get the "value" 72 for the property class "Surface", what is the most efficient way?
Here's I started to do: First, I extract all "property":
xpathApply(page, '//span[@class="property"]',xmlValue)
Then I extract all "value":
xpathApply(page, '//span[@class="value"]',xmlValue)
Then I build a list or a matrix, so that I can identify the value of "Surface", which is 72. But the problem is that sometimes, a span with class="property"
can not have a span with class="value" that just follows in a h2
. So I can not build a proper list.
Could this be the most efficient way? Identify the span with class="property"
, then identify the h2
that contains this span
, then identify the span
with class="value"
?
Upvotes: 2
Views: 408
Reputation: 43354
XPath can evaluate the contents of a tag using its own function text()
. Using rvest
for simplicity:
library(rvest)
html <- '<div class="line">
<h2 class="clearfix">
<span class="property">Number<div>number extra</div></span>
<span class="value">3</span>
</h2>
</div>
<div class="line">
<h2 class="clearfix">
<span class="property">Surface</span>
<span class="value">72</span>
</h2>
</div>'
html %>% read_html() %>% # read html
html_nodes(xpath = '//span[text()="Number"]/*') %>% # select node
html_text() # get text contents of node
# [1] "number extra"
XPath also has selectors to follow family axes, in this case following::
:
html %>% read_html() %>% # read html
html_nodes(xpath = '//span[text()="Surface"]/following::*') %>% # select node
html_text() # get text contents of node
# [1] "72"
Upvotes: 1
Reputation: 111686
For your HTML made to be well-formed by adding a single root element,
<?xml version="1.0" encoding="UTF-8"?>
<r>
<div class="line">
<h2 class="clearfix">
<span class="property">Number
<div>number extra</div>
</span>
<span class="value">3</span>
</h2>
</div>
<div class="line">
<h2 class="clearfix">
<span class="property">Surface</span>
<span class="value">72</span>
</h2>
</div>
</r>
(A) This XPath expression,
//span[@class='property' and starts-with(., 'Number')]/div/text()
will return
number extra
as requested.
(B) This XPath expression,
//h2[span[@class='property' and . = 'Surface']]/span[@class='value']/text()
will return
72
as requested.
Upvotes: 1