Yahoo Finance Headlines webpage scraping with R

Question

I would like to use R to download the HTML code of any Yahoo Finance Headlines webpage, select the "headlines" and collect them in Excel. Unfortunately I cannot find and select the HTML nodes corresponding to the headlines once I download the source file to R.

Let me show the problem with an example. I started with

source <- "http://finance.yahoo.com/q/h?s=AAPL+Headlines"
file <- "destination/finance_file.cvs"
download.file(url = source, destfile = file)
 x = scan(file, what = "", sep = "
")

producing the Excel file finance_file.cvs and, most importantly, the character x.

Using x I would like to collect the headlines and write them into a column in a second Excel file, called headlines.cvs.

My problem now is the following: if I select any headline I can find it in the HTML code of the webpage itself, but I lose its track in x. Therefore, I do not know how to extract it.

For the extraction I was thinking of

x = x[grep("some string of characters to do the job", x)]

but I am no expert in web scraping. Any ideas/suggestions?

I thank you very much!

Vincent Zoonekynd · Accepted Answer

You can use the XML package and write the XPath query needed to extract the headlines.

Since the web page looks like:

...


  First headline
  ...

you get the following query.

library(XML)
source <- "http://finance.yahoo.com/q/h?s=AAPL+Headlines"
d <- htmlParse(source)
xpathSApply(d, "//ul[contains(@class,'newsheadlines')]/following::ul/li/a", xmlValue)
free(d)

Yahoo Finance Headlines webpage scraping with R

Answers (1)

Related Questions