Nancy
Nancy

Reputation: 4099

Find html table name and scrape in R

I'm trying to scrape a table from a web page that has multiple tables. I'd like to get the "FIPS Codes for the States and the District of Columbia" table from https://www.census.gov/geo/reference/ansi_statetables.html . I think the XML::readHTMLTable() is the right way to go, but when I try the following I get an error:

url = "https://www.census.gov/geo/reference/ansi_statetables.html"
readHTMLTable(url, header = T, stringsAsFactors = F)

named list() Warning message: XML content does not seem to be XML: 'https://www.census.gov/geo/reference/ansi_statetables.html'

This is not surprising, of course, because I'm not giving the function any indication of which table I'd like to read. I've dug around in "Inspect" for quite a while but I'm not connecting dots on how to be more precise. There doesn't seem to be a name or class of the table that is analogous to other examples I've found in documentation or on SO. Thoughts?

Upvotes: 2

Views: 757

Answers (2)

Parfait
Parfait

Reputation: 107687

Consider using readLines() to scrape the html page content and use result in readHTMLTable():

url = "https://www.census.gov/geo/reference/ansi_statetables.html"
webpage <- readLines(url)

readHTMLTable(webpage, header = T, stringsAsFactors = F)               # LIST OF 3 TABLES

# $`NULL`
#                    Name FIPS State Numeric Code Official USPS Code
# 1               Alabama                      01                 AL
# 2                Alaska                      02                 AK
# 3               Arizona                      04                 AZ
# 4              Arkansas                      05                 AR
# 5            California                      06                 CA
# 6              Colorado                      08                 CO
# 7           Connecticut                      09                 CT
# 8              Delaware                      10                 DE
# 9  District of Columbia                      11                 DC
# 10              Florida                      12                 FL
# 11              Georgia                      13                 GA
# 12               Hawaii                      15                 HI
# 13                Idaho                      16                 ID
# 14             Illinois                      17                 IL
# ...

For specific dataframe return:

fipsdf <- readHTMLTable(webpage, header = T, stringsAsFactors = F)[[1]]

Upvotes: 3

Rentrop
Rentrop

Reputation: 21507

Another solution using rvest instead of XML is:

require(rvest)
read_html("https://www.census.gov/geo/reference/ansi_statetables.html") %>% 
  html_table %>% .[[1]]

Upvotes: 1

Related Questions