Mona Jalal
Mona Jalal

Reputation: 38155

Web Scraping in R--readHTMLTable has table names as NULL

Here's my code for reading the tables but the tables which are read are having a NULL name. Is there a better method for finding the land area of each state in square miles without the commas in the numbers? I had the idea of extracting the table and going to the second table and converting it to data.frame but now that they have NULL names I am not sure what should I do or if there's a better method

require("XML")
url="http://simple.wikipedia.org/wiki/List_of_U.S._states_by_area"
wiki_page=readLines(url)
length(wiki_page)
tables=readHTMLTable(url)

Here's a sample output:

> tables
$`NULL`
   Rank          State       km²     miles²
1     1         Alaska 1,717,854    663,267
2     2          Texas   696,621    268,581
3     3     California   423,970    163,696
4     4        Montana   380,838    147,042
5     5     New Mexico   314,915    121,589
6     6        Arizona   295,254    113,998
7     7         Nevada   286,351    110,561
8     8       Colorado   269,601    104,094
9     9         Oregon   254,805     98,381
....

Upvotes: 2

Views: 858

Answers (1)

agstudy
agstudy

Reputation: 121568

You should read the names and assign them to tables:

library(XML)
require("XML")
url="http://simple.wikipedia.org/wiki/List_of_U.S._states_by_area"
doc <- htmlParse(url)
nn <- xpathSApply(doc,'//*[@class="mw-headline"]',xmlValue)[-4]
tabs <- readHTMLTable(url)
names(tabs) <- nn

Check the result :

str(tabs,max=1)
# List of 3
# $ Total area:'data.frame':  50 obs. of  4 variables:
#   $ Land area :'data.frame':  50 obs. of  4 variables:
#   $ Water area:'data.frame':  50 obs. of  5 variables:

numeric conversion :

convert_num <- 
  function(x)as.numeric(gsub(',','',x))

lapply(tabs,function(y){
  y[,-c(1,2)] <- sapply(y[,-c(1,2)],convert_num)
  y

})

Upvotes: 1

Related Questions