Reputation: 43
I am getting the error above when trying to use the cv.lm fucntion. Please see my code
sample<-read.csv("UU2_1_lung_cancer.csv",header=TRUE,sep=",",na.string="NA")
sample1<-sample[2:2000,3:131]
samplex<-sample[2:50,3:131]
y<-as.numeric(sample1[1,])
y<-as.numeric(sample1[2:50,2])
x1<-as.numeric(sample1[2:50,3])
x2<-as.numeric(sample1[2:50,4])
x11<-x1[!is.na(y)]
x12<-x2[!is.na(y)]
y<-y[!is.na(y)]
fit1 <- lm(y ~ x11 + x12, data=sample)
fit1
x3<-as.numeric(sample1[2:50,5])
x4<-as.numeric(sample1[2:50,6])
x13<-x3[!is.na(y)]
x14<-x4[!is.na(y)]
fit2 <- lm(y ~ x11 + x12 + x13 + x14, data=sample)
anova(fit1,fit2)
install.packages("DAAG")
library("DAAG")
cv.lm(df=samplex, fit1, m=10) # 3 fold cross-validation
Any insight will be appreciated.
Example of data
ID peak height LCA001 LCA002 LCA003
N001786 32391.111 0.397 0.229 -0.281
N005356 32341.473 0.397 -0.655 -1.301
N002416 32215.474 -0.703 -0.214 -0.901
GS239 31949.777 0.354 0.118 0.272
N016343 31698.853 0.226 0.04 -0.006
N003255 31604.978 0.024 NA -0.534
N004358 31356.597 -0.252 -0.022 -0.407
N000122 31168.09 -0.487 -0.533 -0.134
GS10564 31106.103 -0.156 -0.141 -1.17
GS17987 31043.876 NA 0.253 0.553
N003674 30876.207 0.109 0.093 0.07
Please see the example of the data above
Upvotes: 0
Views: 13917
Reputation: 59345
First, you are using lm(..)
incorrectly, or at least in a very unconventional way. The purpose of specifying the data=sample
argument is so that the formula uses references to columns of the sample
. Generally, it is a very bad practice to use free-standing data in the formula reference.
So try this:
## not tested...
sample <- read.csv(...)
colnames(sample)[2:6] <- c("y","x1","x2","x3","x4")
fit1 <- lm(y~x1+x2, data=sample[2:50,],na.action=na.omit)
library(DAAG)
cv.lm(df=na.omit(sample[2:50,]),fit1,m=10)
This will give columns 2:6 the appropriate names and then use those in the formula. The argument na.action=na.omit
tells the lm(...)
function to exclude all rows where there is an NA value in any of the relevant columns. This is actually the default, so it is not needed in this case, but included for clarity.
Finally, cv.lm(...)
uses it's second argument to find the formula definition, so in your code:
cv.lm(df=samplex, fit1, m=10)
is equivalent to:
cv.lm(df=samplex,y~x11+x12,m=10)
Since there are (presumeably) no columns named x11
and x12
in samplex
, and since you define these vectors externally, cv.lm(...)
throws the error you are getting.
Upvotes: 1