Rich Scriven
Rich Scriven

Reputation: 99331

Extracting information from an HTML parsed document

I'm new to HTML and XML processing, and have a question about extracting certain parts of a parsed HTML document. The following document doc, was produced with htmlTreeParse in package XML, and can be reproduced with the following code.

What happens to the four-digit number at the end of each line when href="..." is accessed? I need to extract those numbers, but they seem to disappear.

library(XML)
doc <- htmlTreeParse("http://www.retrosheet.org/gamelogs/index",
                     useInternalNodes = TRUE)
doc["//a/@href"][100:101]
# [[1]]
#                                            href 
# "http://www.retrosheet.org/gamelogs/gl1924.zip" 
# attr(,"class")
# [1] "XMLAttributeValue"
# 
# [[2]]
#                                            href 
# "http://www.retrosheet.org/gamelogs/gl1925.zip" 
# attr(,"class")
# [1] "XMLAttributeValue"

So basically, from the following I want to extract the final four digits. The result should be the vector

[1] 1871 1872 ... ... 2012 2013

Here's a peek at the html document

...
  <br/>
    </b>
    <pre>
    <a href="http://www.retrosheet.org/gamelogs/gl1871.zip">1871</a>
    <a href="http://www.retrosheet.org/gamelogs/gl1872.zip">1872</a>
    ...                                                     ....
    ...                                                     ....
    <a href="http://www.retrosheet.org/gamelogs/gl2012.zip">2012</a>
    <a href="http://www.retrosheet.org/gamelogs/gl2013.zip">2013</a>
    </pre>
    <a href="http://www.retrosheet.org/gamelogs/glws.zip">World Series</a>
    <br/>
    <a href="http://www.retrosheet.org/gamelogs/glas.zip">All-Star</a>
    <br/>

Upvotes: 1

Views: 95

Answers (1)

lukeA
lukeA

Reputation: 54237

If you want the value instead of the href attribute, try one of the following:

doc["//a/text()"][100:101] 
sapply(doc["//a"][100:101], xmlValue) 
sapply(doc["//a"][100:101], xmlValue, trim = TRUE) 

Upvotes: 1

Related Questions