happymappy
happymappy

Reputation: 181

Extract text and numbers from web page using regex in R

I want to use R to extract text and numbers from the following page: https://iaspub.epa.gov/enviro/fii_query_dtl.disp_program_facility?pgm_sys_id_in=PA0261696&pgm_sys_acrnm_in=NPDES

Specifically, I want the NPDES SIC code and the description, which is 6515 and "Operators of residential mobile home sites" here.

library(rvest)

test <- read_html("https://iaspub.epa.gov/enviro/fii_query_dtl.disp_program_facility?pgm_sys_id_in=MDG766216&pgm_sys_acrnm_in=NPDES")

test <- test %>% html_nodes("tr") %>% html_text()

# This extracts 31 lines of text; here is what my target text looks like:

#  [10] "NPDES\n6515\nOPERATORS OF RESIDENTIAL MOBILE HOME SITES\n\n" 

Ideally, I'd like the following: "6515 OPERATORS OF RESIDENTIAL MOBILE HOME SITES"

How would I do this? I'm trying and failing at regex here even just trying to extract the number 6515 alone, which I thought would be easiest...

as.numeric(sub(".*?NPDES.*?(\\d{4}).*", "\\1", test))

# 4424

Any advice?

Upvotes: 1

Views: 81

Answers (1)

Dhiraj
Dhiraj

Reputation: 1720

From what I can see, your information resides in a table. It might be a better idea to perhaps just extract the information as a dataframe itself. This works:

library(rvest)

test <- read_html("https://iaspub.epa.gov/enviro/fii_query_dtl.disp_program_facility?pgm_sys_id_in=MDG766216&pgm_sys_acrnm_in=NPDES")

tables <- html_nodes(test, "table")
tables

SIC <- as.data.frame(html_table(tables[5], fill = TRUE))

Upvotes: 2

Related Questions