Frank
Frank

Reputation: 111

How to do web scraping of key stats from FinViz tables per stock?

Is anyone experienced in scraping data with R?

I would like to extract the corresponding data for given stocks. I did this in the following way:

library(XML)

stocks <- c("AAPL","MSFT")

for (s in stocks) {
  url <- paste0("http://finviz.com/quote.ashx?t=", s)
  webpage <- readLines(url)
  html <- htmlTreeParse(webpage, useInternalNodes = TRUE, asText = TRUE)
  tableNodes <- getNodeSet(html, "//table")

  # ASSIGN TO STOCK NAMED DFS
  assign(s, readHTMLTable(tableNodes[[9]], 
                          header= c("data1", "data2", "data3", "data4", "data5", "data6",
                                    "data7", "data8", "data9", "data10", "data11", "data12")))

  # ADD COLUMN TO IDENTIFY STOCK 
  df <- get(s)
  df['stock'] <- s
  assign(s, df)
}

# COMBINE ALL STOCK DATA 
stockdatalist <- cbind(mget(stocks))
stockdata <- do.call(rbind, stockdatalist)
# MOVE STOCK ID TO FIRST COLUMN
stockdata <- stockdata[, c(ncol(stockdata), 1:ncol(stockdata)-1)]

The problem, however, is that I have obtained it in the wrong format:

  stock      data1       data2       data3 data4       data5  data6         data7  data8        data9          data10       data11 data12
1  AAPL      Index DJIA S&P500         P/E 16.13   EPS (ttm)  10.22   Insider Own  0.06% Shs Outstand           5.09B    Perf Week -7.35%
2  AAPL Market Cap     839.87B Forward P/E 12.50  EPS next Y  13.20 Insider Trans -7.80%    Shs Float           5.07B   Perf Month -4.38%
3  AAPL     Income      53.13B         PEG  1.38  EPS next Q   2.71      Inst Own 63.20%  Short Float           1.16% Perf Quarter -5.40%
4  AAPL      Sales     239.18B         P/S  3.51  EPS this Y 10.80%    Inst Trans  0.98%  Short Ratio            1.60  Perf Half Y  7.53%
5  AAPL    Book/sh       27.42         P/B  6.02  EPS next Y 14.97%           ROA 13.80% Target Price          192.54    Perf Year 17.05%
6  AAPL    Cash/sh       15.15         P/C 10.89 EPS next 5Y 11.68%           ROE 37.40%    52W Range 138.62 - 183.50     Perf YTD -2.54%

What I would like to do is that a certain stock only appears once in the row name and that the datanames are then displayed as column names where the columns then contain the corresponding numbers...

Upvotes: 0

Views: 1150

Answers (1)

mysteRious
mysteRious

Reputation: 4294

This is not super elegant but it gets the job done. It relies on the knowledge that the data you're interested in is in the 6th table on this page:

library(rvest)
url <- read_html("http://finviz.com/quote.ashx?t=AAPL")
tables <- html_nodes(url,"table")
scraped <- tables %>% html_nodes("table") %>% .[6] %>% 
    html_table(fill=TRUE) %>% data.frame()

This yields data that looks like:

> str(scraped)
'data.frame':   12 obs. of  12 variables:
 $ X1 : chr  "Index" "Market Cap" "Income" "Sales" ...
 $ X2 : chr  "DJIA S&P500" "839.87B" "53.13B" "239.18B" ...
 $ X3 : chr  "P/E" "Forward P/E" "PEG" "P/S" ...
 $ X4 : chr  "16.13" "12.50" "1.38" "3.51" ...
... the rest is truncated

You can get the format you want using stack:

new_df <- as.data.frame(scraped[, seq(from=2, to=ncol(scraped), by=2)])
new_labels <- as.data.frame(scraped[, seq(from=1, to=ncol(scraped), by=2)])
results_df <- as.data.frame(cbind(stack(new_labels),stack(new_df))[,c(1,3)])
names(results_df) <- c("stock","AAPL")

This provides a results_df that looks like this, which contains all 72 variable/value pairs:

> head(results_df)
    Variable       Value
1      Index DJIA S&P500
2 Market Cap     839.87B
3     Income      53.13B
4      Sales     239.18B
5    Book/sh       27.42
6    Cash/sh       15.15

Then you can transpose it and add stock labels with as.data.frame(t(rbind(c("stock","AAPL"),results_df))) .

Upvotes: 1

Related Questions