Extract text and numbers from web page using regex in R

Question

I want to use R to extract text and numbers from the following page: https://iaspub.epa.gov/enviro/fii_query_dtl.disp_program_facility?pgm_sys_id_in=PA0261696&pgm_sys_acrnm_in=NPDES

Specifically, I want the NPDES SIC code and the description, which is 6515 and "Operators of residential mobile home sites" here.

library(rvest)

test <- read_html("https://iaspub.epa.gov/enviro/fii_query_dtl.disp_program_facility?pgm_sys_id_in=MDG766216&pgm_sys_acrnm_in=NPDES")

test <- test %>% html_nodes("tr") %>% html_text()

# This extracts 31 lines of text; here is what my target text looks like:

#  [10] "NPDES
6515
OPERATORS OF RESIDENTIAL MOBILE HOME SITES

"

Ideally, I'd like the following: "6515 OPERATORS OF RESIDENTIAL MOBILE HOME SITES"

How would I do this? I'm trying and failing at regex here even just trying to extract the number 6515 alone, which I thought would be easiest...

as.numeric(sub(".*?NPDES.*?(\d{4}).*", "\1", test))

# 4424

Any advice?

Dhiraj · Accepted Answer

From what I can see, your information resides in a table. It might be a better idea to perhaps just extract the information as a dataframe itself. This works:

library(rvest)

test <- read_html("https://iaspub.epa.gov/enviro/fii_query_dtl.disp_program_facility?pgm_sys_id_in=MDG766216&pgm_sys_acrnm_in=NPDES")

tables <- html_nodes(test, "table")
tables

SIC <- as.data.frame(html_table(tables[5], fill = TRUE))

Extract text and numbers from web page using regex in R

Answers (1)

Related Questions