Reputation: 21625
I'm having trouble following the selected answer to this question. The table I'm trying to scrape is this list of U.S. state populations.
library(XML)
theurl <- "http://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population"
tables <- readHTMLTable(theurl)
n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))
This is the error I'm getting..
Error: failed to load external entity "http://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population"
What gives?
(Note - although I'm looking to resolve this error, if you can point me to an easier way of getting population data I'd appreciate it.)
Upvotes: 3
Views: 132
Reputation: 19005
This is pretty easy to do in rvest
library(rvest); library(magrittr) # for %>%
theurl %>%
html() %>%
html_nodes("table") %>% extract(1) %>%
html_table(fill=TRUE) %>% extract(1) -> pop_table
See @Cory's blog for more info.
Upvotes: 1
Reputation: 4568
There is nothing wrong with your code. There is, however, something wrong with your URL.
You can test this by going to a shell and attempting to verify that the external inputs into your code are not causing it to fail, e.g.,
curl https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population
which will return an empty body, similar to your R code. This should lead you to believe that it isn't your R code that is faulty. Upon making this discovery, you might proceed to the section in the page in which you are interested, again using your free and easy test environment in curl, and run
curl https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population#States_and_territories
which will most definitely not return an empty result:
...
<body class="mediawiki ltr sitedir-ltr ns-0 ns-subject page-List_of_U_S_states_and_territories_by_population skin-vector action-view">
<div id="mw-page-base" class="noprint"></div>
<div id="mw-head-base" class="noprint"></div>
<div id="content" class="mw-body" role="main">
Upvotes: 2