mks212
mks212

Reputation: 921

Trouble Populating a data frame & Renaming Columns When Moving The Data To Another Data Frame

I am running a script that downloads stock prices from Yahoo Finance and then calculates the log of their price changes, the standard deviation for the prior 20 days, multiplies the ith price by the ith standard deviation and then divides the current price change by the price change of the prior day or i-1.

I have for loops everywhere which is the first red flag since there are better ways to do things in R. I am more than happy to use vectorized operations wherever possible.

I am defining an empty data frame in R to contain the information I want. I am reading that this is a bad idea. I do know how big the data frame will be since I know the number of symbols for which I am downloading prices. However, when I initialize the data frame, it simply populates the number of rows specified with 0s and adds the data I want beneath. It is easy to simply fix this by deleting the rows after the fact, but I am sure there is a cleaner way.

The below code creates an error:

 1: In `[<-.factor`(`*tmp*`, ri, value = "IBM") :
invalid factor level, NA generated.  

If I change the 0s to 1s when initializing SpikeData, it will populate just fine but then I need to remove the blank first row which is done in the 3rd to last line of the script.

The second issue I have is with the column names in the SpikeRank data frame. I want the first column to be titled "Symbols" and the next 5 to be dates. However, I do not seem to be able to mix and match. If I set the first col name to "Symbol," the dates are converted to a serial number. Further down the road with this script, I will be adding other columns that also have characters, so I'd like to be able to mix and match the dates and characters in the column names.

GetVol <- function(Window = 20, end=as.Date(Sys.time()), start = end-37){

library(tseries)
library(zoo)

SpikeRank = data.frame(stringsAsFactors=FALSE, Symbol=character(0),
SDInPriceTerms1=numeric(0), SDInPriceTerms2=numeric(0),
SDInPriceTerms3=numeric(0), SDInPriceTerms4=numeric(0), 
SDInPriceTerms5=numeric(0))

SymbolList <- c("AAPL", "IBM")

for (Symbol in SymbolList)
 {
ts <- get.hist.quote(instrument=Symbol, 
                   start, end, 
                   quote="Close", provider="yahoo", origin="1970-01-01",
                   compression="d", retclass="zoo")

df <- data.frame(ts)
df <- data.frame(Date=as.Date(rownames(df)), Close=df$Close)


df$PriorClose <- c(NA, head(df$Close, -1))
df$Return <- log(df$Close/df$PriorClose)

for (i in 1:length(df$Close))
{
  if(i < Window+1)  
  {
    df$stddev[i] = NA
    df$SDPrice[i] = NA
    df$CurrentSpike[i] = NA
  }

  else
  { 
    df$stddev[i] <- sd(df$Return[(i-Window+1):i], na.rm = TRUE)
    df$SDPrice[i] <- df$stddev[i] * df$Close[i]
    df$CurrentSpike[i] <- (df$Close[i] - df$PriorClose[i])/df$SDPrice[i-1]
  }

}#end for

df <- na.omit(df)
SpikeRank <-  rbind(SpikeRank, c(Symbol, df$CurrentSpike))

}#end for loop

SpikeRank <- SpikeRank[-1,]
colnames(SpikeRank) <- c(as.Date(1), df$Date)
print(SpikeRank)

)

The print of SpikeRank assuming that the data frame is initialized with 1s instead of 0s is below. The values in the rows are all correct. I am not having any issue with the calculations.

1970-01-02         2014-03-03        2014-03-04        2014-03-05         2014-03-06
2       AAPL  0.268505943103897 0.613087867831883 0.195414096323545 -0.289567687725125
3        IBM -0.600147413085885  1.43686920161242 0.592564096496001  0.426680866502713
      2014-03-07
2 -0.055981998552076
3 0.0344269384370513

Here is df for IBM in case that is helpful:

     Date  Close PriorClose        Return      stddev  SDPrice CurrentSpike
1  2014-01-30 177.36         NA            NA          NA       NA           NA
2  2014-01-31 176.68     177.36 -0.0038413786          NA       NA           NA
3  2014-02-03 172.90     176.68 -0.0216267940          NA       NA           NA
4  2014-02-04 172.84     172.90 -0.0003470816          NA       NA           NA
5  2014-02-05 174.24     172.84  0.0080673481          NA       NA           NA
6  2014-02-06 174.67     174.24  0.0024648203          NA       NA           NA
7  2014-02-07 177.25     174.67  0.0146626860          NA       NA           NA
8  2014-02-10 177.14     177.25 -0.0006207850          NA       NA           NA
9  2014-02-11 179.70     177.14  0.0143484134          NA       NA           NA
10 2014-02-12 180.24     179.70  0.0030005023          NA       NA           NA
11 2014-02-13 181.84     180.24  0.0088378834          NA       NA           NA
12 2014-02-14 183.69     181.84  0.0101223746          NA       NA           NA
13 2014-02-18 183.19     183.69 -0.0027256886          NA       NA           NA
14 2014-02-19 182.95     183.19 -0.0013109741          NA       NA           NA
15 2014-02-20 184.26     182.95  0.0071349122          NA       NA           NA
16 2014-02-21 182.79     184.26 -0.0080098508          NA       NA           NA
17 2014-02-24 183.45     182.79  0.0036041979          NA       NA           NA
18 2014-02-25 183.23     183.45 -0.0011999565          NA       NA           NA
19 2014-02-26 184.06     183.23  0.0045195971          NA       NA           NA
20 2014-02-27 185.27     184.06  0.0065524292          NA       NA           NA
21 2014-02-28 185.17     185.27 -0.0005398985 0.008188660 1.516294           NA
22 2014-03-03 184.26     185.17 -0.0049265184 0.008233949 1.517188  -0.60014741
23 2014-03-04 186.44     184.26  0.0117616678 0.006336123 1.181307   1.43686920
24 2014-03-05 187.14     186.44  0.0037475283 0.006261815 1.171836   0.59256410
25 2014-03-06 187.64     187.14  0.0026682336 0.006192074 1.161881   0.42668087
26 2014-03-07 187.68     187.64  0.0002131514 0.006236115 1.170394   0.03442694

Thank you for your help.

Upvotes: 1

Views: 185

Answers (1)

jlhoward
jlhoward

Reputation: 59395

First of all, your code does not run as presented. This is because you don't bother to define start, end, or Window. So I had to infer those values from your df for IBM. (BTW: This is probably why no one else has bothered to respond. The expectation on SO is that, if you want help, you will provide a working example).

So this is a shorter way to produce what you have. Note the use of rollapply(...) and head(...) to avoid the inner loop, and the use of do.call(rbind,lapply(...)) to avoid the outer loop and the pre-allocation of SpikeRank. Your insistence on having dates as column names created a lot of problems, because default behavior in most functions that create data frames is to avoid column names that begin with a number.

library(tseries)   # for get.hist.quote
library(zoo)       # for rollapply

start      <- "2014-01-30"
end        <- "2014-03-07"
Window     <- 20
SymbolList <- c("AAPL","IBM")

get.SpikeRank <- function(Symbol,start, end, Window) {
  ts <- get.hist.quote(instrument=Symbol, 
                       start, end, 
                       quote="Close", provider="yahoo", origin="1970-01-01",
                       compression="d", retclass="zoo")
  df <- data.frame(ts)
  df <- data.frame(Date=as.Date(rownames(df)), Close=df$Close)
  df$PriorClose   <- c(NA, head(df$Close, -1))
  df$Return       <- log(df$Close/df$PriorClose)
  df$stdev        <- c(rep(NA,Window),rollapply(df$Return[-1],width=Window,sd,na.rm=T))
  df$SDPrice      <- df$stdev * df$Close
  df$CurrentSpike <- (df$Close - df$PriorClose)/c(NA,head(df$SDPrice,-1))
  df              <- na.omit(df)
  row             <- df$CurrentSpike
  names(row)      <- df$Date
  return(row)
}
SpikeRank <- do.call(rbind,lapply(SymbolList,get.SpikeRank,start,end,Window))
SpikeRank <- data.frame(Symbol=SymbolList, SpikeRank)
colnames(SpikeRank)[-1] <- substring(colnames(SpikeRank)[-1],2)
print(SpikeRank)
#   Symbol 2014.03.03 2014.03.04 2014.03.05 2014.03.06  2014.03.07
# 1   AAPL  0.2685059  0.6130879  0.1954141 -0.2895677 -0.05598200
# 2    IBM -0.6001474  1.4368692  0.5925641  0.4266809  0.03442694

Upvotes: 1

Related Questions