Reputation: 30301
I'm trying to write a function in r that, given an address, will return a list of links on that webpage.
For example:
getLinks("http://prog21.dadgum.com/109.html")
Would return:
"http://prog21.dadgum.com/prog21.css"
"http://prog21.dadgum.com/atom.xml"
"http://prog21.dadgum.com/index.html"
"http://prog21.dadgum.com/archives.html"
"http://prog21.dadgum.com/atom.xml"
"http://prog21.dadgum.com/56.html"
"http://prog21.dadgum.com/39.html"
"http://prog21.dadgum.com/109.html"
"http://prog21.dadgum.com/108.html"
"http://prog21.dadgum.com/107.html"
"http://prog21.dadgum.com/106.html"
"http://prog21.dadgum.com/105.html"
"http://prog21.dadgum.com/104.html"
Upvotes: 2
Views: 309
Reputation: 30301
This function seems to work on other webpages, but for some reason does not return the complete URLs for the page in question. I'm interested to see if there's a better way to do this.
getLinks <- function(URL) {
require(XML)
doc <- htmlParse(URL)
out <- unlist(doc['//@href'])
names(out) <- NULL
out
}
Upvotes: 3