Carol.Kar
Carol.Kar

Reputation: 5355

R WebCrawler - XML content does not seem to be XML:

I took the following code from the rNomads package and modified it a little bit.

When initially running it I get:

> WebCrawler(url = "www.bikeforums.net")
[1] "www.bikeforums.net"
[1] "www.bikeforums.net"

Warning message:
XML content does not seem to be XML: 'www.bikeforums.net' 

Here is the code:

require("XML")

# cleaning workspace
rm(list = ls())

# This function recursively searches for links in the given url and follows every single link.
# It returns a list of the final (dead end) URLs.
# depth - How many links to return. This avoids having to recursively scan hundreds of links. Defaults to NULL, which returns everything.
WebCrawler <- function(url, depth = NULL, verbose = TRUE) {

  doc <- XML::htmlParse(url)
  links <- XML::xpathSApply(doc, "//a/@href")
  XML::free(doc)
  if(is.null(links)) {
    if(verbose) {
      print(url)
    }
    return(url)
  } else {
    urls.out <- vector("list", length = length(links))
    for(link in links) {
      if(!is.null(depth)) {
        if(length(unlist(urls.out)) >= depth) {
          break
        }
      }
      urls.out[[link]] <- WebCrawler(link, depth = depth, verbose = verbose)
    }
    return(urls.out)
  }
}


# Execution
WebCrawler(url = "www.bikeforums.net")

Any recommendation what I am doing wrong?

UPDATE

Hello guys,

I started this bounty, because I think in the R community there is need for such a function, which can crawl webpages. The solution, which would win the bounty should show a function which takes two parameters:

WebCrawler(url = "www.bikeforums.net", xpath = "\\title" )

I really appreciate your replies

Upvotes: 10

Views: 855

Answers (1)

dimitris_ps
dimitris_ps

Reputation: 5951

Insert the following code under links <- XML::xpathSApply(doc, "//a/@href") in your function.

links <- XML::xpathSApply(doc, "//a/@href")
links1 <- links[grepl("http", links)] # As @Floo0 pointed out this is to capture non relative links
links2 <- paste0(url, links[!grepl("http", links)]) # and to capture relative links
links <- c(links1, links2)

And also remember to have the url as http://www......

Also you are not updating your urls.out list. As you have it, it is always going to be an empty list of length the same as the length as links

Upvotes: 2

Related Questions