happymappy
happymappy

Reputation: 181

Scraping data from web into dataframe with rvest

I have a question about scraping data from web pages into a dataframe...I have a script that works somewhat but could be improved and I'm struggling to implement the improvements I want.

With a lot of help from googling and reading through StackOverflow materials, I've made a script that grabs NPDES SIC code data from the EPA website using a list of permit numbers.

library(rvest)
library(tidyverse)

# Start with permit number list 

permitlist <- list("DC0000175", "VA0076384", "VA0021318")

# For each permit number, create a URL list that leads to the EPA site associated with that permit

urls <- list()
for(i in 1:length(permitlist)) {
  url <- paste0("https://iaspub.epa.gov/enviro/fii_query_dtl.disp_program_facility?pgm_sys_id_in=", permitlist[i], "&pgm_sys_acrnm_in=NPDES")
urls[i] <- url
} 

# Open each URL and scrape SIC code data

complete<-data.frame()
tbl <- list()

for (j in seq_along(urls)) {
  data <- tbl[[j]] <- urls[[j]] %>%  
    read_html() %>% 
    html_nodes("tr") %>%
    html_text()
  data <- unlist(strsplit(trimws(grep("^NPDES\n", tbl[[j]], value=TRUE)[1]), "\n", perl=TRUE))
  complete <-rbind(complete,data, stringsAsFactors = FALSE)
  }

The output table looks like this:


1    NPDES    3273                READY-MIXED CONCRETE
2    NPDES    3714                MOTOR VEHICLE PARTS AND ACCESSORIES
3    NPDES    4952                SEWERAGE SYSTEMS

# (Row number, type of permit (all will be NPDES), NPDES SIC code, and description.)

I would like to edit the script so that the dataframe includes a column for the permit number. Also, some permit numbers have more than one NPDES SIC code (ex: DC0000175 at https://iaspub.epa.gov/enviro/fii_query_dtl.disp_program_facility?pgm_sys_id_in=DC0000175&pgm_sys_acrnm_in=NPDES has two NPDES SIC codes), but I've only been able to get the script to grab the first SIC code.

Ideally, my final dataframe would look something like the following:


1    NPDES    3273    DC0000175  READY-MIXED CONCRETE
2    NPDES    1611    DC0000175  HIGHWAY AND STREET CONSTRUCTION, EXCEPT ELEVATED HIGHWAYS
3    NPDES    3714    VA0076384  MOTOR VEHICLE PARTS AND ACCESSORIES
4    NPDES    4952    VA0021318  SEWERAGE SYSTEMS

I've been struggling for a while now to get this to work and would appreciate any help.

Upvotes: 0

Views: 170

Answers (1)

Andy Baxter
Andy Baxter

Reputation: 7646

Make two changes:

  • change permitlist from a list to a vector using c() - this makes it run a whole lot faster too!
  • add a new column to your data df and assign the corresponding permit value
library(rvest)
library(tidyverse)

# Start with permit number list 

# vector is more efficient and more easily iterable
permitlist <- c("DC0000175", "VA0076384", "VA0021318")

# For each permit number, create a URL list that leads to the EPA site associated with that permit

urls <- list()
for(i in 1:length(permitlist)) {
  url <- paste0("https://iaspub.epa.gov/enviro/fii_query_dtl.disp_program_facility?pgm_sys_id_in=", permitlist[i], "&pgm_sys_acrnm_in=NPDES")
  urls[i] <- url
} 

# Open each URL and scrape SIC code data

complete<-data.frame()
tbl <- list()

for (j in seq_along(urls)) {
  tbl[[j]] <- urls[[j]] %>%  
    read_html() %>% 
    html_nodes("tr") %>%
    html_text()

  data <- data.frame(A = NA, B = NA, c = NA, permit = NA)
  a <- grep("^NPDES\n", tbl[[j]], value=TRUE)

  for(k in seq_along(a)){
  dt <- unlist(strsplit(trimws(a[k]), "\n", perl=TRUE))
  dt <- c(dt, permitlist[j])
  data[k,] <- dt
  }
  complete <- bind_rows(complete, data)

}

complete

Edited to make output dataframe a bit tidier, by assigning column names (A, B, C, permit) before filling table.

Edited 2: now gets more than one item from each url call

Upvotes: 2

Related Questions