Reputation: 3
I'm trying to pull html data tables into a single data frame, and I'm looking for an elegant solution. There are 255 tables, and the urls vary by two variable: Year and Aldermanic District. I know there must be a way to use for loops or something, but I'm stumped.
I have successfully imported the data by reading each table in with a separate line of code, but this results in a line for each table, and again, there are 255 tables.
library(XML)
data <- bind_rows(readHTMLTable("http://assessments.milwaukee.gov/SalesData/2018_RVS_Dist14.htm", skip.rows=1),
readHTMLTable("http://assessments.milwaukee.gov/SalesData/2017_RVS_Dist14.htm", skip.rows=1),
readHTMLTable("http://assessments.milwaukee.gov/SalesData/2016_RVS_Dist14.htm", skip.rows=1),
readHTMLTable("http://assessments.milwaukee.gov/SalesData/2015_RVS_Dist14.htm", skip.rows=1),
Ideally, I could use a for
loop or something so I wouldn't have to hand code the readHTMLTable
function for each table.
Upvotes: 0
Views: 46
Reputation: 39154
We can use map_dfr
from the purrr
package (part of the tidyverse
) package to apply the readHTMLTable
function across the URL. The key is to identify the part that is different from each URL. In this case 2015:2018
is the only thing changed, so we can construct the URL with paste0
. map_dfr
would automatically combine all data frame to return one combined data frame. dat
is the final output.
library(tidyverse)
library(XML)
dat <- map_dfr(2015:2018,
~readHTMLTable(paste0("http://assessments.milwaukee.gov/SalesData/",
.x,
"_RVS_Dist14.htm"), skip.rows = 1)[[1]])
Update
Here is the way to expand the combination between year and numbers, and then download the data with map2_dfr
.
url <- expand.grid(Year = 2002:2018, Number = 1:15)
dat <- map2_dfr(url$Year, url$Number,
~readHTMLTable(paste0("http://assessments.milwaukee.gov/SalesData/",
.x,
"_RVS_Dist",
.y,
".htm"), skip.rows = 1)[[1]])
Upvotes: 1
Reputation: 521073
You could try creating a vector containing all the URLs which you want to scrape, and then iterate over those inputs using a for
loop:
url1 <- "http://assessments.milwaukee.gov/SalesData/"
url2 <- "_RVS_Dist"
years <- c(2015:2018)
dist <- c(1:15)
urls <- apply(expand.grid(paste0(url1, years), paste0(url2, dist)), 1, paste, collapse="")
data <- NULL
for (url in urls) {
df <- readHTMLTable(url)
data <- rbind(data, df)
}
Upvotes: 1