ZLevine
ZLevine

Reputation: 322

lapply, difference between c() and list()

What is the difference between c() and list()? I'm learning some webscraping and ran into an unexpected error. I wrote a small script to scrape baseball data from a few pages on ESPN's website:

library(magrittr)
library(rvest)

Baseball <- read_html("http://www.espn.com/mlb/stats/batting/_/qualified/true")
Baseball.2 <- read_html("http://www.espn.com/mlb/stats/batting/_/count/41/qualified/true")
Baseball.3 <- read_html("http://www.espn.com/mlb/stats/batting/_/count/81/qualified/true")
Baseball.4 <- read_html("http://www.espn.com/mlb/stats/batting/_/count/121/qualified/true")
Baseball.list <- c(Baseball, Baseball.2, Baseball.3, Baseball.4)


scrape <- function(html) {
  temp.df <- data.frame(1:length(html %>%
                                   html_nodes(paste0("td:nth-child(2)")) %>%
                                   html_text()))
  for (i in 2:19) {
  temp.df[i - 1] <- 
    html %>%
    html_nodes(paste0("td:nth-child(", i, ")")) %>%
    html_text()
  }
  temp.df
}

when I run df <- lapply(Baseball.list, scrape) I get:

Error in UseMethod("xml_find_all") : 
  no applicable method for 'xml_find_all' applied to an object of class "externalptr" 

But, if I run Baseball.list <- list(Baseball, Baseball.2, Baseball.3, Baseball.4) and then use lapply and my function in the exact same way it works without a problem! I checked the documentation for c() and see that: "This is a generic function which combines its arguments. The default method combines its arguments to form a vector. All arguments are coerced to a common type which is the type of the returned value, and all attributes except names are removed," whereas the documentation for list()says its coerces objects into a list. Can someone explain why using c() in this instance causes lapply to fail? I'm not understanding the documentation.

Upvotes: 4

Views: 4668

Answers (1)

Jean
Jean

Reputation: 1490

Exactly as the documentation for c() says,

"All arguments are coerced to a common type which is the type of the returned value, and all attributes except names are removed"

The list keeps the classes of the documents as was intended by xml2::read_html. If you look at the source code for xml2 , you'll see that the generic method xml_find_all is only defined for classes of xml_missing, xml_node and xml_nodeset

> class(read_html("<html><title>Hi<title></html>"))
[1] "xml_document" "xml_node"    
> a = read_html("<html><title>Hi<title></html>")
> b = read_html("<html><title>Hi<title></html>")
> c = read_html("<html><title>Hi<title></html>")
> lapply(c(a,b,c), class)
$node
[1] "externalptr"

$doc
[1] "externalptr"

$node
[1] "externalptr"

$doc
[1] "externalptr"

$node
[1] "externalptr"

$doc
[1] "externalptr"

> lapply(list(a,b,c), class)
[[1]]
[1] "xml_document" "xml_node"    

[[2]]
[1] "xml_document" "xml_node"    

[[3]]
[1] "xml_document" "xml_node"

Upvotes: 2

Related Questions