Pecners
Pecners

Reputation: 3

Reading numerous html tables into R

I'm trying to pull html data tables into a single data frame, and I'm looking for an elegant solution. There are 255 tables, and the urls vary by two variable: Year and Aldermanic District. I know there must be a way to use for loops or something, but I'm stumped.

I have successfully imported the data by reading each table in with a separate line of code, but this results in a line for each table, and again, there are 255 tables.

library(XML)
data <- bind_rows(readHTMLTable("http://assessments.milwaukee.gov/SalesData/2018_RVS_Dist14.htm", skip.rows=1),
                   readHTMLTable("http://assessments.milwaukee.gov/SalesData/2017_RVS_Dist14.htm", skip.rows=1),
                   readHTMLTable("http://assessments.milwaukee.gov/SalesData/2016_RVS_Dist14.htm", skip.rows=1),
                   readHTMLTable("http://assessments.milwaukee.gov/SalesData/2015_RVS_Dist14.htm", skip.rows=1),

Ideally, I could use a for loop or something so I wouldn't have to hand code the readHTMLTable function for each table.

Upvotes: 0

Views: 46

Answers (2)

www
www

Reputation: 39154

We can use map_dfr from the purrr package (part of the tidyverse) package to apply the readHTMLTable function across the URL. The key is to identify the part that is different from each URL. In this case 2015:2018 is the only thing changed, so we can construct the URL with paste0. map_dfr would automatically combine all data frame to return one combined data frame. dat is the final output.

library(tidyverse)
library(XML)

dat <- map_dfr(2015:2018,
               ~readHTMLTable(paste0("http://assessments.milwaukee.gov/SalesData/",
                                     .x,
                                     "_RVS_Dist14.htm"), skip.rows = 1)[[1]])

Update

Here is the way to expand the combination between year and numbers, and then download the data with map2_dfr.

url <- expand.grid(Year = 2002:2018, Number = 1:15)

dat <- map2_dfr(url$Year, url$Number,
               ~readHTMLTable(paste0("http://assessments.milwaukee.gov/SalesData/",
                                     .x,
                                     "_RVS_Dist",
                                     .y,
                                     ".htm"), skip.rows = 1)[[1]]) 

Upvotes: 1

Tim Biegeleisen
Tim Biegeleisen

Reputation: 521073

You could try creating a vector containing all the URLs which you want to scrape, and then iterate over those inputs using a for loop:

url1 <- "http://assessments.milwaukee.gov/SalesData/"
url2 <- "_RVS_Dist"
years <- c(2015:2018)
dist <- c(1:15)
urls <- apply(expand.grid(paste0(url1, years), paste0(url2, dist)), 1, paste, collapse="")
data <- NULL
for (url in urls) {
    df <- readHTMLTable(url)
    data <- rbind(data, df)
}

Upvotes: 1

Related Questions