Reputation: 181
I want to use R to extract text and numbers from the following page: https://iaspub.epa.gov/enviro/fii_query_dtl.disp_program_facility?pgm_sys_id_in=PA0261696&pgm_sys_acrnm_in=NPDES
Specifically, I want the NPDES SIC code and the description, which is 6515 and "Operators of residential mobile home sites" here.
library(rvest)
test <- read_html("https://iaspub.epa.gov/enviro/fii_query_dtl.disp_program_facility?pgm_sys_id_in=MDG766216&pgm_sys_acrnm_in=NPDES")
test <- test %>% html_nodes("tr") %>% html_text()
# This extracts 31 lines of text; here is what my target text looks like:
# [10] "NPDES\n6515\nOPERATORS OF RESIDENTIAL MOBILE HOME SITES\n\n"
Ideally, I'd like the following: "6515 OPERATORS OF RESIDENTIAL MOBILE HOME SITES"
How would I do this? I'm trying and failing at regex here even just trying to extract the number 6515 alone, which I thought would be easiest...
as.numeric(sub(".*?NPDES.*?(\\d{4}).*", "\\1", test))
# 4424
Any advice?
Upvotes: 1
Views: 81
Reputation: 1720
From what I can see, your information resides in a table. It might be a better idea to perhaps just extract the information as a dataframe itself. This works:
library(rvest)
test <- read_html("https://iaspub.epa.gov/enviro/fii_query_dtl.disp_program_facility?pgm_sys_id_in=MDG766216&pgm_sys_acrnm_in=NPDES")
tables <- html_nodes(test, "table")
tables
SIC <- as.data.frame(html_table(tables[5], fill = TRUE))
Upvotes: 2