Reputation: 739
I am attempting to scrape a National Weather Service webpage and take only a certain portion of the text out and turn it into a character object in R. It would end up being a small paragraph as shown on the NWS page. (see below)
I have been scraping the webpage with the rvest package and have tried some code with the XML package as well.
Here is my code which has the Weather Service URL included.
weather_con <- read_html("http://forecast.weather.gov/product.php?site=TWC&issuedby=TWC&product=AFD&format=txt&version=1&glossary=1")
weather_con <- weather_con %>%
html_nodes("#localcontent") %>%
html_text()
I've also tried using both the rvest and XML packages with this code
weather_con <- getURL("http://forecast.weather.gov/product.php?site=TWC&issuedby=TWC&product=AFD&format=txt&version=1&glossary=1")
weather_con <- htmlParse(weather_con, asText = T)
Both of these sets of code read in all the text from the page. I've tried other options and have attempted to find the nodes of the page to scrape certain portions of the text, but I haven't found anything useful. I have little experience with HTML so I might be missing something easy here.
All I am looking to pull out of the webpage is the SYNOPSIS paragraph. It is a small paragraph near the top of the page and conveniently ends with two && symbols a line below where the paragraph ends.
Perhaps I need something like the substr
function where I can scrape that paragraph directly. However, I was hoping to find something in rvest and or XML to do the job.
Any suggestions?
Thank you
Upvotes: 1
Views: 484
Reputation: 9313
The weather_con
already has the text you need, but it comes along with all the rest of the text.
One way to extract it is using regular expressions.
synopsis = regmatches(x = weather_con,
m = regexpr(pattern = "SYNOPSIS[^&]*",
text = weather_con))
This will capture everything from SYNOPSIS until the first non &
.
Result:
[1] "SYNOPSIS...Strong high pressure aloft will
maintain well above\naverage temperatures today. Thursday
and Friday will see us between\nlow pressure developing
north of the area and high pressure shifting\nsouthward.
As a result, expect gusty winds and several degrees
of\ncooling. Strengthening high pressure this weekend
will again push\ntemperatures above average.\n\n"
If the synopsys contains an &
then you could capture the text until the word DISCUSSION.
synopsis2 = regmatches(x = weather_con,
m = regexpr(pattern = "SYNOPSIS.*DISCUSSION",
text = weather_con))
The result is similar. This result ends with above average.\n\n&&\n\n.DISCUSSION
Upvotes: 2