Avitus
Avitus

Reputation: 744

Yahoo Finance Headlines webpage scraping with R

I would like to use R to download the HTML code of any Yahoo Finance Headlines webpage, select the "headlines" and collect them in Excel. Unfortunately I cannot find and select the HTML nodes corresponding to the headlines once I download the source file to R.

Let me show the problem with an example. I started with

source <- "http://finance.yahoo.com/q/h?s=AAPL+Headlines"
file <- "destination/finance_file.cvs"
download.file(url = source, destfile = file)
 x = scan(file, what = "", sep = "\n")

producing the Excel file finance_file.cvs and, most importantly, the character x.

Using x I would like to collect the headlines and write them into a column in a second Excel file, called headlines.cvs.

My problem now is the following: if I select any headline I can find it in the HTML code of the webpage itself, but I lose its track in x. Therefore, I do not know how to extract it.

For the extraction I was thinking of

x = x[grep("some string of characters to do the job", x)]

but I am no expert in web scraping. Any ideas/suggestions?

I thank you very much!

Upvotes: 1

Views: 1359

Answers (1)

Vincent Zoonekynd
Vincent Zoonekynd

Reputation: 32401

You can use the XML package and write the XPath query needed to extract the headlines.

Since the web page looks like:

...
<ul class="newsheadlines"/>
<ul>
  <li><a href="...">First headline</a></li>
  ...

you get the following query.

library(XML)
source <- "http://finance.yahoo.com/q/h?s=AAPL+Headlines"
d <- htmlParse(source)
xpathSApply(d, "//ul[contains(@class,'newsheadlines')]/following::ul/li/a", xmlValue)
free(d)

Upvotes: 1

Related Questions