mql4beginner
mql4beginner

Reputation: 2233

glm function causes a strange change in data frame

I'm working on a data set of IBM by using quantmod. I created two variables and then I used the glm function to see the relation between the two of them. The code ran good but then I noticed that part of the data frame contains NAs. How can I overcome this issue? Here is my code:

library("quantmod")
getSymbols("IBM")
dim(IBM)
IBM$CurrtDay_up <- ifelse(IBM$IBM.Open < IBM$IBM.Close,1,0)
IBM$LastDay_green <- ifelse((lag(IBM$IBM.Open,k=1) < lag(IBM$IBM.Close,k=1)),1,0)
head(IBM)
           IBM.Open IBM.High IBM.Low IBM.Close IBM.Volume IBM.Adjusted CurrtDay_up LastDay_green
2007-01-03    97.18    98.40   96.26     97.27    9196800     82.78498           1            NA
2007-01-04    97.25    98.79   96.88     98.31   10524500     83.67011           1             1
2007-01-05    97.60    97.95   96.91     97.42    7221300     82.91264           0             1
2007-01-08    98.50    99.50   98.35     98.90   10340000     84.17225           1             0
2007-01-09    99.08   100.33   99.07    100.07   11108200     85.16802           1             1
2007-01-10    98.50    99.05   97.93     98.89    8744800     84.16374           1             1

then I added the glm function:

IBM_1 <- IBM[3:1000,] # to avoid the first row's NA.
glm_greenDay <- glm(CurrtDay_up~LastDay_green,data=IBM_1,family=binomial(link='logit'))
IBM_1$glm_pred<-predict(glm_greenDay,newdata=IBM_1,type='response')
head(IBM_1)
           IBM.Open IBM.High IBM.Low IBM.Close IBM.Volume IBM.Adjusted CurrtDay_up LastDay_green  glm_pred
2007-01-04       NA       NA      NA        NA         NA           NA          NA            NA 0.5683453
2007-01-05    97.60    97.95   96.91     97.42    7221300     82.91264           0             1        NA
2007-01-07       NA       NA      NA        NA         NA           NA          NA            NA 0.5407240
2007-01-08    98.50    99.50   98.35     98.90   10340000     84.17225           1             0        NA
2007-01-08       NA       NA      NA        NA         NA           NA          NA            NA 0.5683453
2007-01-09    99.08   100.33   99.07    100.07   11108200     85.16802           1             1        NA

UPDATED CODE (please notice that one row (row # 2) has been duplicated: :

 IBM_1<-IBM[complete.cases(IBM[1:1000,]),] # to evoid the first row's NA.
 glm_greenDay<-glm(CurrtDay_up~LastDay_green,data=IBM_1,family=binomial(link='logit'))
 IBM_1$glm_pred<-glm_greenDay$fitted.values
 head(IBM_1)
           IBM.Open IBM.High IBM.Low IBM.Close IBM.Volume IBM.Adjusted CurrtDay_up LastDay_green  glm_pred
2007-01-03       NA       NA      NA        NA         NA           NA          NA            NA 0.5691203
2007-01-04    97.25    98.79   96.88     98.31   10524500     83.67011           1             1        NA
2007-01-04       NA       NA      NA        NA         NA           NA          NA            NA 0.5691203
2007-01-05    97.60    97.95   96.91     97.42    7221300     82.91264           0             1        NA
2007-01-07       NA       NA      NA        NA         NA           NA          NA            NA 0.5407240
2007-01-08    98.50    99.50   98.35     98.90   10340000     84.17225           1             0        NA

Upvotes: 1

Views: 97

Answers (2)

ulfelder
ulfelder

Reputation: 5335

The problem is arising because the output of predict() is not an xts class object. The slots in the vector of predicted values have dates for names, but the vector is still just a vector without time indexing. I was able to get a simple call to merge() to work without dropping NAs before modeling by converting the output of predict() to class xts first:

library(quantmod)
getSymbols("IBM")
IBM$CurrtDay_up <- ifelse(IBM$IBM.Open < IBM$IBM.Close, 1, 0)
IBM$LastDay_green <- ifelse((lag(IBM$IBM.Open, k=1) < lag(IBM$IBM.Close, k=1)), 1, 0)
glm_greenDay <- glm(CurrtDay_up~LastDay_green, data=IBM, family=binomial(link='logit'), na.action=na.exclude)
glm_pred <- predict(glm_greenDay, type='response')
glm_pred_xts <- xts(x = glm_pred, order.by = as.Date(names(glm_pred)))
IBM2 <- merge(IBM, glm_pred_xts)

That seems to produce the desired output:

> head(glm_pred)
2007-01-03 2007-01-04 2007-01-05 2007-01-08 2007-01-09 2007-01-10 
        NA  0.5383952  0.5383952  0.5383065  0.5383952  0.5383952 

> head(IBM2)
           IBM.Open IBM.High IBM.Low IBM.Close IBM.Volume IBM.Adjusted CurrtDay_up LastDay_green glm_pred_xts
2007-01-03    97.18    98.40   96.26     97.27    9196800     82.78498           1            NA           NA
2007-01-04    97.25    98.79   96.88     98.31   10524500     83.67011           1             1    0.5383952
2007-01-05    97.60    97.95   96.91     97.42    7221300     82.91264           0             1    0.5383952
2007-01-08    98.50    99.50   98.35     98.90   10340000     84.17225           1             0    0.5383065
2007-01-09    99.08   100.33   99.07    100.07   11108200     85.16802           1             1    0.5383952
2007-01-10    98.50    99.05   97.93     98.89    8744800     84.16374           1             1    0.5383952

Upvotes: 1

Draguru
Draguru

Reputation: 11

Might be how you're constructing your final data frame and how R handles NAs.

The way I read your code you're adding the result column to the data frame with:

IBM_1$glm_pred<-glm_greenDay$fitted.values

You might be able to throw your result into a separate object and use cbind to attach it to the rest of your data frame without propagating the NAs across columns

Maybe...

glm_pred<-matrix(glm_greenDay$fitted.values,ncol=1)
IBM_glm<-cbind(IBM_1,glm_pred)

Don't know if it's the most elegant but might be a start.

Upvotes: 1

Related Questions