Reputation: 181
I have a question about scraping data from web pages into a dataframe...I have a script that works somewhat but could be improved and I'm struggling to implement the improvements I want.
With a lot of help from googling and reading through StackOverflow materials, I've made a script that grabs NPDES SIC code data from the EPA website using a list of permit numbers.
library(rvest)
library(tidyverse)
# Start with permit number list
permitlist <- list("DC0000175", "VA0076384", "VA0021318")
# For each permit number, create a URL list that leads to the EPA site associated with that permit
urls <- list()
for(i in 1:length(permitlist)) {
url <- paste0("https://iaspub.epa.gov/enviro/fii_query_dtl.disp_program_facility?pgm_sys_id_in=", permitlist[i], "&pgm_sys_acrnm_in=NPDES")
urls[i] <- url
}
# Open each URL and scrape SIC code data
complete<-data.frame()
tbl <- list()
for (j in seq_along(urls)) {
data <- tbl[[j]] <- urls[[j]] %>%
read_html() %>%
html_nodes("tr") %>%
html_text()
data <- unlist(strsplit(trimws(grep("^NPDES\n", tbl[[j]], value=TRUE)[1]), "\n", perl=TRUE))
complete <-rbind(complete,data, stringsAsFactors = FALSE)
}
The output table looks like this:
1 NPDES 3273 READY-MIXED CONCRETE
2 NPDES 3714 MOTOR VEHICLE PARTS AND ACCESSORIES
3 NPDES 4952 SEWERAGE SYSTEMS
# (Row number, type of permit (all will be NPDES), NPDES SIC code, and description.)
I would like to edit the script so that the dataframe includes a column for the permit number. Also, some permit numbers have more than one NPDES SIC code (ex: DC0000175 at https://iaspub.epa.gov/enviro/fii_query_dtl.disp_program_facility?pgm_sys_id_in=DC0000175&pgm_sys_acrnm_in=NPDES has two NPDES SIC codes), but I've only been able to get the script to grab the first SIC code.
Ideally, my final dataframe would look something like the following:
1 NPDES 3273 DC0000175 READY-MIXED CONCRETE
2 NPDES 1611 DC0000175 HIGHWAY AND STREET CONSTRUCTION, EXCEPT ELEVATED HIGHWAYS
3 NPDES 3714 VA0076384 MOTOR VEHICLE PARTS AND ACCESSORIES
4 NPDES 4952 VA0021318 SEWERAGE SYSTEMS
I've been struggling for a while now to get this to work and would appreciate any help.
Upvotes: 0
Views: 170
Reputation: 7646
Make two changes:
c()
- this makes it run a whole lot faster too!data
df and assign the corresponding permit valuelibrary(rvest)
library(tidyverse)
# Start with permit number list
# vector is more efficient and more easily iterable
permitlist <- c("DC0000175", "VA0076384", "VA0021318")
# For each permit number, create a URL list that leads to the EPA site associated with that permit
urls <- list()
for(i in 1:length(permitlist)) {
url <- paste0("https://iaspub.epa.gov/enviro/fii_query_dtl.disp_program_facility?pgm_sys_id_in=", permitlist[i], "&pgm_sys_acrnm_in=NPDES")
urls[i] <- url
}
# Open each URL and scrape SIC code data
complete<-data.frame()
tbl <- list()
for (j in seq_along(urls)) {
tbl[[j]] <- urls[[j]] %>%
read_html() %>%
html_nodes("tr") %>%
html_text()
data <- data.frame(A = NA, B = NA, c = NA, permit = NA)
a <- grep("^NPDES\n", tbl[[j]], value=TRUE)
for(k in seq_along(a)){
dt <- unlist(strsplit(trimws(a[k]), "\n", perl=TRUE))
dt <- c(dt, permitlist[j])
data[k,] <- dt
}
complete <- bind_rows(complete, data)
}
complete
Edited to make output dataframe a bit tidier, by assigning column names (A, B, C, permit) before filling table.
Edited 2: now gets more than one item from each url call
Upvotes: 2