user3424320
user3424320

Reputation: 43

variable lengths differ in R

I am getting the error above when trying to use the cv.lm fucntion. Please see my code

sample<-read.csv("UU2_1_lung_cancer.csv",header=TRUE,sep=",",na.string="NA")
  sample1<-sample[2:2000,3:131]
  samplex<-sample[2:50,3:131]
  y<-as.numeric(sample1[1,]) 
  y<-as.numeric(sample1[2:50,2]) 
  x1<-as.numeric(sample1[2:50,3])
  x2<-as.numeric(sample1[2:50,4])
  x11<-x1[!is.na(y)]
  x12<-x2[!is.na(y)]
  y<-y[!is.na(y)]
  fit1 <- lm(y ~ x11 + x12, data=sample)
  fit1
  x3<-as.numeric(sample1[2:50,5])
  x4<-as.numeric(sample1[2:50,6])
  x13<-x3[!is.na(y)]
  x14<-x4[!is.na(y)]
  fit2 <- lm(y ~ x11 + x12 + x13 + x14, data=sample)
  anova(fit1,fit2)
  install.packages("DAAG")
  library("DAAG")
  cv.lm(df=samplex, fit1, m=10) # 3 fold cross-validation

Any insight will be appreciated.

Example of data
ID       peak height     LCA001 LCA002  LCA003
N001786 32391.111   0.397   0.229   -0.281
N005356 32341.473   0.397   -0.655  -1.301
N002416 32215.474   -0.703  -0.214  -0.901
GS239   31949.777   0.354   0.118   0.272
N016343 31698.853   0.226   0.04    -0.006
N003255 31604.978   0.024    NA -0.534
N004358 31356.597   -0.252  -0.022  -0.407
N000122 31168.09    -0.487  -0.533  -0.134
GS10564 31106.103   -0.156  -0.141  -1.17
GS17987 31043.876    NA     0.253   0.553
N003674 30876.207   0.109   0.093   0.07

Please see the example of the data above

Upvotes: 0

Views: 13917

Answers (1)

jlhoward
jlhoward

Reputation: 59345

First, you are using lm(..) incorrectly, or at least in a very unconventional way. The purpose of specifying the data=sample argument is so that the formula uses references to columns of the sample. Generally, it is a very bad practice to use free-standing data in the formula reference.

So try this:

## not tested...
sample <- read.csv(...)
colnames(sample)[2:6] <- c("y","x1","x2","x3","x4")
fit1 <- lm(y~x1+x2, data=sample[2:50,],na.action=na.omit)
library(DAAG)
cv.lm(df=na.omit(sample[2:50,]),fit1,m=10)

This will give columns 2:6 the appropriate names and then use those in the formula. The argument na.action=na.omit tells the lm(...) function to exclude all rows where there is an NA value in any of the relevant columns. This is actually the default, so it is not needed in this case, but included for clarity.

Finally, cv.lm(...) uses it's second argument to find the formula definition, so in your code:

cv.lm(df=samplex, fit1, m=10)

is equivalent to:

cv.lm(df=samplex,y~x11+x12,m=10)

Since there are (presumeably) no columns named x11 and x12 in samplex, and since you define these vectors externally, cv.lm(...) throws the error you are getting.

Upvotes: 1

Related Questions