Reputation: 5355
I took the following code from the rNomads package and modified it a little bit.
When initially running it I get:
> WebCrawler(url = "www.bikeforums.net")
[1] "www.bikeforums.net"
[1] "www.bikeforums.net"
Warning message:
XML content does not seem to be XML: 'www.bikeforums.net'
Here is the code:
require("XML")
# cleaning workspace
rm(list = ls())
# This function recursively searches for links in the given url and follows every single link.
# It returns a list of the final (dead end) URLs.
# depth - How many links to return. This avoids having to recursively scan hundreds of links. Defaults to NULL, which returns everything.
WebCrawler <- function(url, depth = NULL, verbose = TRUE) {
doc <- XML::htmlParse(url)
links <- XML::xpathSApply(doc, "//a/@href")
XML::free(doc)
if(is.null(links)) {
if(verbose) {
print(url)
}
return(url)
} else {
urls.out <- vector("list", length = length(links))
for(link in links) {
if(!is.null(depth)) {
if(length(unlist(urls.out)) >= depth) {
break
}
}
urls.out[[link]] <- WebCrawler(link, depth = depth, verbose = verbose)
}
return(urls.out)
}
}
# Execution
WebCrawler(url = "www.bikeforums.net")
Any recommendation what I am doing wrong?
UPDATE
Hello guys,
I started this bounty, because I think in the R community there is need for such a function, which can crawl webpages. The solution, which would win the bounty should show a function which takes two parameters:
WebCrawler(url = "www.bikeforums.net", xpath = "\\title" )
I really appreciate your replies
Upvotes: 10
Views: 855
Reputation: 5951
Insert the following code under links <- XML::xpathSApply(doc, "//a/@href")
in your function.
links <- XML::xpathSApply(doc, "//a/@href")
links1 <- links[grepl("http", links)] # As @Floo0 pointed out this is to capture non relative links
links2 <- paste0(url, links[!grepl("http", links)]) # and to capture relative links
links <- c(links1, links2)
And also remember to have the url
as http://www......
Also you are not updating your urls.out
list. As you have it, it is always going to be an empty list of length the same as the length as links
Upvotes: 2