scrape a certain portion of HTML text in R

Question

I am attempting to scrape a National Weather Service webpage and take only a certain portion of the text out and turn it into a character object in R. It would end up being a small paragraph as shown on the NWS page. (see below)

I have been scraping the webpage with the rvest package and have tried some code with the XML package as well.

Here is my code which has the Weather Service URL included.

weather_con <- read_html("http://forecast.weather.gov/product.php?site=TWC&issuedby=TWC&product=AFD&format=txt&version=1&glossary=1")

weather_con <- weather_con %>%
 html_nodes("#localcontent") %>%
  html_text()

I've also tried using both the rvest and XML packages with this code

weather_con <- getURL("http://forecast.weather.gov/product.php?site=TWC&issuedby=TWC&product=AFD&format=txt&version=1&glossary=1")

weather_con <- htmlParse(weather_con, asText = T)

Both of these sets of code read in all the text from the page. I've tried other options and have attempted to find the nodes of the page to scrape certain portions of the text, but I haven't found anything useful. I have little experience with HTML so I might be missing something easy here.

All I am looking to pull out of the webpage is the SYNOPSIS paragraph. It is a small paragraph near the top of the page and conveniently ends with two && symbols a line below where the paragraph ends.

Perhaps I need something like the substrfunction where I can scrape that paragraph directly. However, I was hoping to find something in rvest and or XML to do the job.

Any suggestions?

Thank you

R. Schifini · Accepted Answer

The weather_con already has the text you need, but it comes along with all the rest of the text.

One way to extract it is using regular expressions.

synopsis = regmatches(x = weather_con, 
                      m = regexpr(pattern = "SYNOPSIS[^&]*",
                                  text = weather_con))

This will capture everything from SYNOPSIS until the first non &.

Result:

 [1] "SYNOPSIS...Strong high pressure aloft will
 maintain well above
average temperatures today. Thursday
 and Friday will see us between
low pressure developing
 north of the area and high pressure shifting
southward.
 As a result, expect gusty winds and several degrees
 of
cooling. Strengthening high pressure this weekend
 will again push
temperatures above average.

"

If the synopsys contains an & then you could capture the text until the word DISCUSSION.

synopsis2 = regmatches(x = weather_con, 
                       m = regexpr(pattern = "SYNOPSIS.*DISCUSSION",
                                   text = weather_con))

The result is similar. This result ends with above average. && .DISCUSSION

scrape a certain portion of HTML text in R

Answers (1)

Related Questions