Reputation: 99331
I'm new to HTML and XML processing, and have a question about extracting certain parts of a parsed HTML document. The following document doc
, was produced with htmlTreeParse
in package XML
, and can be reproduced with the following code.
What happens to the four-digit number at the end of each line when href="..."
is accessed? I need to extract those numbers, but they seem to disappear.
library(XML)
doc <- htmlTreeParse("http://www.retrosheet.org/gamelogs/index",
useInternalNodes = TRUE)
doc["//a/@href"][100:101]
# [[1]]
# href
# "http://www.retrosheet.org/gamelogs/gl1924.zip"
# attr(,"class")
# [1] "XMLAttributeValue"
#
# [[2]]
# href
# "http://www.retrosheet.org/gamelogs/gl1925.zip"
# attr(,"class")
# [1] "XMLAttributeValue"
So basically, from the following I want to extract the final four digits. The result should be the vector
[1] 1871 1872 ... ... 2012 2013
Here's a peek at the html document
...
<br/>
</b>
<pre>
<a href="http://www.retrosheet.org/gamelogs/gl1871.zip">1871</a>
<a href="http://www.retrosheet.org/gamelogs/gl1872.zip">1872</a>
... ....
... ....
<a href="http://www.retrosheet.org/gamelogs/gl2012.zip">2012</a>
<a href="http://www.retrosheet.org/gamelogs/gl2013.zip">2013</a>
</pre>
<a href="http://www.retrosheet.org/gamelogs/glws.zip">World Series</a>
<br/>
<a href="http://www.retrosheet.org/gamelogs/glas.zip">All-Star</a>
<br/>
Upvotes: 1
Views: 95
Reputation: 54237
If you want the value instead of the href attribute, try one of the following:
doc["//a/text()"][100:101]
sapply(doc["//a"][100:101], xmlValue)
sapply(doc["//a"][100:101], xmlValue, trim = TRUE)
Upvotes: 1