Reputation: 111
Is anyone experienced in scraping data with R?
I would like to extract the corresponding data for given stocks. I did this in the following way:
library(XML)
stocks <- c("AAPL","MSFT")
for (s in stocks) {
url <- paste0("http://finviz.com/quote.ashx?t=", s)
webpage <- readLines(url)
html <- htmlTreeParse(webpage, useInternalNodes = TRUE, asText = TRUE)
tableNodes <- getNodeSet(html, "//table")
# ASSIGN TO STOCK NAMED DFS
assign(s, readHTMLTable(tableNodes[[9]],
header= c("data1", "data2", "data3", "data4", "data5", "data6",
"data7", "data8", "data9", "data10", "data11", "data12")))
# ADD COLUMN TO IDENTIFY STOCK
df <- get(s)
df['stock'] <- s
assign(s, df)
}
# COMBINE ALL STOCK DATA
stockdatalist <- cbind(mget(stocks))
stockdata <- do.call(rbind, stockdatalist)
# MOVE STOCK ID TO FIRST COLUMN
stockdata <- stockdata[, c(ncol(stockdata), 1:ncol(stockdata)-1)]
The problem, however, is that I have obtained it in the wrong format:
stock data1 data2 data3 data4 data5 data6 data7 data8 data9 data10 data11 data12
1 AAPL Index DJIA S&P500 P/E 16.13 EPS (ttm) 10.22 Insider Own 0.06% Shs Outstand 5.09B Perf Week -7.35%
2 AAPL Market Cap 839.87B Forward P/E 12.50 EPS next Y 13.20 Insider Trans -7.80% Shs Float 5.07B Perf Month -4.38%
3 AAPL Income 53.13B PEG 1.38 EPS next Q 2.71 Inst Own 63.20% Short Float 1.16% Perf Quarter -5.40%
4 AAPL Sales 239.18B P/S 3.51 EPS this Y 10.80% Inst Trans 0.98% Short Ratio 1.60 Perf Half Y 7.53%
5 AAPL Book/sh 27.42 P/B 6.02 EPS next Y 14.97% ROA 13.80% Target Price 192.54 Perf Year 17.05%
6 AAPL Cash/sh 15.15 P/C 10.89 EPS next 5Y 11.68% ROE 37.40% 52W Range 138.62 - 183.50 Perf YTD -2.54%
What I would like to do is that a certain stock only appears once in the row name and that the datanames are then displayed as column names where the columns then contain the corresponding numbers...
Upvotes: 0
Views: 1150
Reputation: 4294
This is not super elegant but it gets the job done. It relies on the knowledge that the data you're interested in is in the 6th table
on this page:
library(rvest)
url <- read_html("http://finviz.com/quote.ashx?t=AAPL")
tables <- html_nodes(url,"table")
scraped <- tables %>% html_nodes("table") %>% .[6] %>%
html_table(fill=TRUE) %>% data.frame()
This yields data that looks like:
> str(scraped)
'data.frame': 12 obs. of 12 variables:
$ X1 : chr "Index" "Market Cap" "Income" "Sales" ...
$ X2 : chr "DJIA S&P500" "839.87B" "53.13B" "239.18B" ...
$ X3 : chr "P/E" "Forward P/E" "PEG" "P/S" ...
$ X4 : chr "16.13" "12.50" "1.38" "3.51" ...
... the rest is truncated
You can get the format you want using stack
:
new_df <- as.data.frame(scraped[, seq(from=2, to=ncol(scraped), by=2)])
new_labels <- as.data.frame(scraped[, seq(from=1, to=ncol(scraped), by=2)])
results_df <- as.data.frame(cbind(stack(new_labels),stack(new_df))[,c(1,3)])
names(results_df) <- c("stock","AAPL")
This provides a results_df
that looks like this, which contains all 72 variable/value pairs:
> head(results_df)
Variable Value
1 Index DJIA S&P500
2 Market Cap 839.87B
3 Income 53.13B
4 Sales 239.18B
5 Book/sh 27.42
6 Cash/sh 15.15
Then you can transpose it and add stock labels with as.data.frame(t(rbind(c("stock","AAPL"),results_df)))
.
Upvotes: 1