Reputation: 744
I would like to use R to download the HTML code of any Yahoo Finance Headlines webpage, select the "headlines" and collect them in Excel. Unfortunately I cannot find and select the HTML nodes corresponding to the headlines once I download the source file to R.
Let me show the problem with an example. I started with
source <- "http://finance.yahoo.com/q/h?s=AAPL+Headlines"
file <- "destination/finance_file.cvs"
download.file(url = source, destfile = file)
x = scan(file, what = "", sep = "\n")
producing the Excel file finance_file.cvs
and, most importantly, the character x
.
Using x
I would like to collect the headlines and write them into a column in a second Excel file, called headlines.cvs
.
My problem now is the following: if I select any headline I can find it in the HTML code of the webpage itself, but I lose its track in x
. Therefore, I do not know how to extract it.
For the extraction I was thinking of
x = x[grep("some string of characters to do the job", x)]
but I am no expert in web scraping. Any ideas/suggestions?
I thank you very much!
Upvotes: 1
Views: 1359
Reputation: 32401
You can use the XML
package and write the XPath query needed to extract the headlines.
Since the web page looks like:
...
<ul class="newsheadlines"/>
<ul>
<li><a href="...">First headline</a></li>
...
you get the following query.
library(XML)
source <- "http://finance.yahoo.com/q/h?s=AAPL+Headlines"
d <- htmlParse(source)
xpathSApply(d, "//ul[contains(@class,'newsheadlines')]/following::ul/li/a", xmlValue)
free(d)
Upvotes: 1