Ben
Ben

Reputation: 21625

Trouble scraping table from Wikipedia

I'm having trouble following the selected answer to this question. The table I'm trying to scrape is this list of U.S. state populations.

library(XML)
theurl <- "http://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population"
tables <- readHTMLTable(theurl)
n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))

This is the error I'm getting..

Error: failed to load external entity "http://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population"

What gives?

(Note - although I'm looking to resolve this error, if you can point me to an easier way of getting population data I'd appreciate it.)

Upvotes: 3

Views: 132

Answers (2)

C8H10N4O2
C8H10N4O2

Reputation: 19005

This is pretty easy to do in rvest

library(rvest); library(magrittr) # for %>%

theurl %>%
  html() %>%
  html_nodes("table") %>% extract(1) %>%
  html_table(fill=TRUE) %>% extract(1) -> pop_table

See @Cory's blog for more info.

Upvotes: 1

Shawn Mehan
Shawn Mehan

Reputation: 4568

There is nothing wrong with your code. There is, however, something wrong with your URL.

You can test this by going to a shell and attempting to verify that the external inputs into your code are not causing it to fail, e.g.,

curl https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population

which will return an empty body, similar to your R code. This should lead you to believe that it isn't your R code that is faulty. Upon making this discovery, you might proceed to the section in the page in which you are interested, again using your free and easy test environment in curl, and run

curl https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population#States_and_territories

which will most definitely not return an empty result:

...
<body class="mediawiki ltr sitedir-ltr ns-0 ns-subject page-List_of_U_S_states_and_territories_by_population skin-vector action-view">
    <div id="mw-page-base" class="noprint"></div>
    <div id="mw-head-base" class="noprint"></div>
    <div id="content" class="mw-body" role="main">

Upvotes: 2

Related Questions