maumag77
maumag77

Reputation: 55

Scraping headlines and dates from Yahoo Finance using R

I'm trying to scrape news with R from Yahoo Finance webpage to build a table with two columns: date and news headlines. Following the instructions from here I correctly create a column with news headlines; next step is to get the date and add it as a column to the table.

I guess I need just to modify this command:

out_dt <- xpathSApply(d, "//ul[contains(@class,'newsheadlines')]/following::ul/li/a", xmlValue)

in order to get the date instead of the headlines from, as an example, this code:

<html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"><title>BMPS.MI Headlines | BANCA MPS Stock - Yahoo! Finance</title><script type="text/javascript" src="http://l.yimg.com/a/i/us/fi/03rd/yg_csstare_nobgcolor.js"></script><link rel="stylesheet" href="http://l.yimg.com/zz/combo?kx/yucs/uh3/uh/1138/css/uh_non_mail-min.css&amp;kx/yucs/uh3s/atomic/84/css/atomic-min.css&amp;kx/yucs/uh_common/meta/3/css/meta-min.css&amp;kx/yucs/uh3/top-bar/366/css/no_icons-min.css&amp;kx/yucs/uh3/search/css/588/blue_border-min.css&amp;kx/yucs/uh3/get-the-app/151/css/get_the_app-min.css&amp;bm/lib/fi/common/p/d/static/css/2.0.356981/2.0.0/mini/yfi_yoda_legacy_lego_concat.css&amp;bm/lib/fi/common/p/d/static/css/2.0.356981/2.0.0/mini/yfi_symbol_suggest.css&amp;bm/lib/fi/common/p/d/static/css/2.0.356981/2.0.0/mini/yui_helper.css&amp;bm/lib/fi/common/p/d/static/css/2.0.356981/2.0.0/mini/yfi_theme_teal.css&amp;bm/lib/fi/common/p/d/static/css/2.0.356981/2.0.0/mini/yfi_follow_quote.css&amp;bm/lib/fi/common/p/d/static/css/2.0.356981/2.0.0/mini/yfi_follow_stencil.css" type="text/css"><script language="javascript">
ll_js = new Array();
</script><script type="text/javascript" src="http://l1.yimg.com/bm/combo?fi/common/p/d/static/js/2.0.356981/2.0.0/mini/yui-min-3.9.1.js&amp;fi/common/p/d/static/js/2.0.356981/yui_2.8.0/build/yuiloader-dom-event/2.0.0/mini/yuiloader-dom-event.js&amp;fi/common/p/d/static/js/2.0.356981/yui_2.8.0/build/container/2.0.0/mini/container.js&amp;fi/common/p/d/static/js/2.0.356981/yui_2.8.0/build/datasource/2.0.0/mini/datasource.js&amp;fi/common/p/d/static/js/2.0.356981/yui_2.8.0/build/autocomplete/2.0.0/mini/autocomplete.js"></script><script language="javascript">
YUI.YUICfg = {"base":"http:\/\/l.yimg.com\/","comboBase":"http:\/\/l.yimg.com\/zz\/combo?","combine":true,"allowRollup":true,"maxURLLength":"2000"}
YUI.YUICfg.root = 'yui:'+YUI.version+'/build/';
YUI.applyConfig(YUI.YUICfg); 
</script><script language="javascript">
ll_js.push({
    'success_callback' : function() {
            YUI().use('stencil', 'follow-quote', 'node', function (Y) {
                var conf = {'xhrBase': '/', 'lang': 'en-US', 'region': 'US', 'loginUrl': 'https://login.yahoo.com/config/login_verify2?&.done=http://finance.yahoo.com/q?s=BMPS.MI&.intl=us'};

                Y.Media.FollowQuote.init(conf, function () {
                    var exchNode = null,
                        followSecClass = "",
                        followHtml = "",
                        followNode = null;

                    followSecClass = Y.Media.FollowQuote.getFollowSectionClass();
                    followHtml = Y.Media.FollowQuote.getFollowBtnHTML({ ticker: 'BMPS.MI', addl_classes: "follow-quote-always-visible", showFollowText: true });
                    followNode = Y.Node.create(followHtml);
                    exchNode = Y.one(".wl_sign");
                    if (!Y.Lang.isNull(exchNode)) { 
                        exchNode.append(followNode);

                    }

                });
            });
    }
});

Any suggestion?

Upvotes: 2

Views: 1593

Answers (1)

Rentrop
Rentrop

Reputation: 21497

You can use rvest as follows:

require(rvest)
doc <- read_html("http://finance.yahoo.com/q/h?s=AAPL+Headlines")
scope <- doc %>% html_nodes("#yfncsumtab li")
res <- lapply(scope, function(li){
  data.frame(stringsAsFactors = FALSE,
    date = li %>% html_node("cite span") %>% html_text,
    headline = li %>% html_node("a") %>% html_text
    )
})
do.call(rbind, res)

This gives you:

                date                                                                                  headline
1   (Tue 3:49AM EDT)                                   US hacks iPhone, ends legal battle but questions linger
2   (Tue 1:27AM EDT)                           Amazon Echo turns into a sleeper hit, offsetting Fire's failure
3   (Tue 1:00AM EDT)                                       Why Everyone Loses in Apple’s Fight Against the FBI
4  (Tue 12:36AM EDT) [$$] US drops Apple case, Japan's negative rate bounty and the criminals paid not to kill
5  (Tue 12:25AM EDT)                              U.S. succeeds in cracking Apple's iPhone, drops legal action
6  (Tue 12:00AM EDT)  [$$] Brussels Attacks: Belgium Turns to U.S. for Help in Scouring Seized Laptops, Phones
7      (Mon, Mar 28)                [$$] FBI Opens San Bernardino Shooter’s iPhone; U.S. Drops Demand on Apple
8      (Mon, Mar 28)                                              Wolverton: Encyption debate isn't going away
9      (Mon, Mar 28)                                            [$$] US drops Apple case after cracking iPhone
10     (Mon, Mar 28)         Words of warning — not celebration — in Silicon Valley after FBI ends Apple fight
11     (Mon, Mar 28)                               [$$] FBI Opens Shooter's iPhone; U.S. Drops Demand on Apple
12     (Mon, Mar 28)                                           FBI hacks into terrorist’s iPhone without Apple
13     (Mon, Mar 28)                                  Justice Department cracks iPhone; withdraws legal action
14     (Mon, Mar 28)                                Apple responds: 'This case should have never been brought'
15     (Mon, Mar 28)                           IPhone Security Is the Casualty in Apple's Victory Over the FBI
16     (Mon, Mar 28)                           Cracked Apple iPhone By F.B.I. Puts Spotlight On Apple Security
17     (Mon, Mar 28)                                    DOJ Drops Apple Case: Bloomberg West (Full Show 03/28)
18     (Mon, Mar 28)                                          Apple, Inc.'s New iPhone SE: Off to a Big Start?
19     (Mon, Mar 28)                                               AP Explains: Apple vs. FBI _ What Happened?
20     (Mon, Mar 28)                                                  PRESS DIGEST- Financial Times - March 29

I do leave the date-parsing to you.

Another alternative would be taking the date from the h3-heading as follows

require(rvest)
doc <- read_html("http://finance.yahoo.com/q/h?s=AAPL+Headlines")
scope <- doc %>% html_nodes("#yfncsumtab")
dates <- scope %>% html_nodes("h3 span") %>% html_text()
headlines <- scope %>% html_nodes("h3 + ul") %>% lapply(. %>% html_nodes("li a") %>% html_text)

# combine both
do.call(rbind,Map(cbind, dates, headlines))

Which results in the following matrix

      [,1]                      [,2]                                                                                       
 [1,] "Tuesday, March 29, 2016" "March 29 Premarket Briefing: 10 Things You Should Know"                                   
 [2,] "Tuesday, March 29, 2016" "You might soon be able to pay for goods in-store using Facebook Messenger"                
 [3,] "Tuesday, March 29, 2016" "FBI unlocks iPhone"                                                                       
 [4,] "Tuesday, March 29, 2016" "US hacks iPhone, ends legal battle but questions linger"                                  
 [5,] "Tuesday, March 29, 2016" "Amazon Echo turns into a sleeper hit, offsetting Fire's failure"                          
 [6,] "Tuesday, March 29, 2016" "Why Everyone Loses in Apple’s Fight Against the FBI"                                      
 [7,] "Tuesday, March 29, 2016" "[$$] US drops Apple case, Japan's negative rate bounty and the criminals paid not to kill"
 [8,] "Tuesday, March 29, 2016" "U.S. succeeds in cracking Apple's iPhone, drops legal action"                             
 [9,] "Tuesday, March 29, 2016" "[$$] Brussels Attacks: Belgium Turns to U.S. for Help in Scouring Seized Laptops, Phones" 
[10,] "Monday, March 28, 2016"  "[$$] FBI Opens San Bernardino Shooter’s iPhone; U.S. Drops Demand on Apple"               
[11,] "Monday, March 28, 2016"  "Wolverton: Encyption debate isn't going away"                                             
[12,] "Monday, March 28, 2016"  "[$$] US drops Apple case after cracking iPhone"                                           
[13,] "Monday, March 28, 2016"  "Words of warning — not celebration — in Silicon Valley after FBI ends Apple fight"        
[14,] "Monday, March 28, 2016"  "[$$] FBI Opens Shooter's iPhone; U.S. Drops Demand on Apple"                              
[15,] "Monday, March 28, 2016"  "FBI hacks into terrorist’s iPhone without Apple"                                          
[16,] "Monday, March 28, 2016"  "Justice Department cracks iPhone; withdraws legal action"                                 
[17,] "Monday, March 28, 2016"  "Apple responds: 'This case should have never been brought'"                               
[18,] "Monday, March 28, 2016"  "IPhone Security Is the Casualty in Apple's Victory Over the FBI"                          
[19,] "Monday, March 28, 2016"  "Cracked Apple iPhone By F.B.I. Puts Spotlight On Apple Security"                          
[20,] "Monday, March 28, 2016"  "DOJ Drops Apple Case: Bloomberg West (Full Show 03/28)"  

Also in the second case i leave the date-parsing to you

Upvotes: 3

Related Questions