R WebCrawler - XML content does not seem to be XML:

Question

I took the following code from the rNomads package and modified it a little bit.

When initially running it I get:

> WebCrawler(url = "www.bikeforums.net")
[1] "www.bikeforums.net"
[1] "www.bikeforums.net"

Warning message:
XML content does not seem to be XML: 'www.bikeforums.net'

Here is the code:

require("XML")

# cleaning workspace
rm(list = ls())

# This function recursively searches for links in the given url and follows every single link.
# It returns a list of the final (dead end) URLs.
# depth - How many links to return. This avoids having to recursively scan hundreds of links. Defaults to NULL, which returns everything.
WebCrawler <- function(url, depth = NULL, verbose = TRUE) {

  doc <- XML::htmlParse(url)
  links <- XML::xpathSApply(doc, "//a/@href")
  XML::free(doc)
  if(is.null(links)) {
    if(verbose) {
      print(url)
    }
    return(url)
  } else {
    urls.out <- vector("list", length = length(links))
    for(link in links) {
      if(!is.null(depth)) {
        if(length(unlist(urls.out)) >= depth) {
          break
        }
      }
      urls.out[[link]] <- WebCrawler(link, depth = depth, verbose = verbose)
    }
    return(urls.out)
  }
}


# Execution
WebCrawler(url = "www.bikeforums.net")

Any recommendation what I am doing wrong?

UPDATE

Hello guys,

I started this bounty, because I think in the R community there is need for such a function, which can crawl webpages. The solution, which would win the bounty should show a function which takes two parameters:

WebCrawler(url = "www.bikeforums.net", xpath = "\title" )

As output I would like to have a data frame with two columns: the website link and if the example xpath expression matches a column with the matched expression.

I really appreciate your replies

R WebCrawler - XML content does not seem to be XML:

Answers (1)

Related Questions