Reputation: 43
I'm attempting to webscrape tax-rates.org to get the average tax percentage for each county in Texas. I have a list of 255 counties in an csv file which I import as "TX_counties", it's a single column table. I have to create the URL for each county as a string, so I set d1 to the first cell using [i,1], then concat it into a URL string, perform the scrape, then add +1 to [i] which makes it go to the second cell for the next county name, and the process continues.
The problem is I can't figure out how to store the scrape results into a "growing list" which I then want to make into a table and save to .csv file at the end. I'm only able to scrape one county at a time and then it re-writes over itself.
Any thoughts? (fairly new to R and scraping in general)
i <- 1
for (i in 1:255) {
d1 <- as.character(TX_counties[i,1])
uri.seed <- paste(c('http://www.tax-rates.org/texas/',d1,'_county_property_tax'), collapse='')
html <- htmlTreeParse(file = uri.seed, isURL=TRUE, useInternalNodes = TRUE)
avg_taxrate <- sapply(getNodeSet(html, "//div[@class='box']/div/div[1]/i[1]"), xmlValue)
t1 <- data.table(d1,avg_taxrate)
i <- i+1
}
write.csv(t1,"2015_TX_PropertyTaxes.csv")
Upvotes: 4
Views: 927
Reputation: 86
you should first initialise a list to store the data scraped with each loop. make sure to initialise it before you go into the loop
then, with each iteration, append on to the list before starting the next iteration. see my answer here
Web Scraping in R with loop from data.frame
Upvotes: 0
Reputation: 78792
This uses rvest
, provides a progress bar and takes advantage of the fact that the URLs are already there for you on the page:
library(rvest)
library(pbapply)
pg <- read_html("http://www.tax-rates.org/texas/property-tax")
# get all the county tax table links
ctys <- html_nodes(pg, "table.propertyTaxTable > tr > td > a[href*='county_property']")
# match your lowercased names
county_name <- tolower(gsub(" County", "", html_text(ctys)))
# spider each page and return the rate %
county_rate <- pbsapply(html_attr(ctys, "href"), function(URL) {
cty_pg <- read_html(URL)
html_text(html_nodes(cty_pg, xpath="//div[@class='box']/div/div[1]/i[1]"))
}, USE.NAMES=FALSE)
tax_table <- data.frame(county_name, county_rate, stringsAsFactors=FALSE)
tax_table
## county_name county_rate
## 1 anderson Avg. 1.24% of home value
## 2 andrews Avg. 0.88% of home value
## 3 angelina Avg. 1.35% of home value
## 4 aransas Avg. 1.29% of home value
write.csv(tax_table, "2015_TX_PropertyTaxes.csv")
NOTE 1: I limited scraping to 4 to not kill the bandwidth of a site that offers free data.
NOTE 2: There are only 254 county tax links available on that site, so you seem to have an extra one if you have 255.
Upvotes: 3
Reputation: 3710
library(RCurl)
library(XML)
tx_c <- c("anderson", "andrews")
res <- sapply(1:2, function(x){
d1 <- as.character(tx_c[x])
uri.seed <- paste(c('http://www.tax-rates.org/texas/',d1,'_county_property_tax'), collapse='')
html <- htmlTreeParse(file = uri.seed, isURL=TRUE, useInternalNodes = TRUE)
avg_taxrate <- sapply(getNodeSet(html, "//div[@class='box']/div/div[1]/i[1]"), xmlValue)
return(c(d1, avg_taxrate))
})
res.df <- data.frame(t(res), stringsAsFactors = FALSE)
names(res.df) <- c("county", "property")
res.df
# county property
# 1 anderson Avg. 1.24% of home value
# 2 andrews Avg. 0.88% of home value
Upvotes: 2