Reputation: 187
I split my entire dataset into two parts: one for training and the other for testint.
The training dataset contains 70 observations and the test dataset contains 14 observations. My model has 1 numeric dependent variable and 5 numeric independent variables.
I run multiple regression with my training dataset, and every time I run the code for the regression, the value of the adjusted R2 in training dataset was not constant but it continuously changed. Its values varied from 60% to 70%.
The function that I used for the data split, contained "sample" and "set.seed" function in its code.
My question is... in this case, how do I interpret the non constant values of adjusted R2 from training dataset? Is it normal?
splitdf <- function(dataframe, seed=NULL) {
if (!is.null(seed)) set.seed(seed)
index <- 1:nrow(dataframe)
trainindex <- sample(index, trunc(length(index)/6))
testset <- dataframe[trainindex, ]
trainset <- dataframe[-trainindex, ]
list(trainset=trainset,testset=testset)
}
splits <- splitdf(df, seed=1234)
str(splits)
my_train <- splits$trainset
my_test <- splits$testset
PS: the model well satisfied with all the linear regression assumptions.
Upvotes: 0
Views: 219
Reputation: 4220
If you use same seed your R2 should not be changing.
#sim data
set.seed(12)
data <- data.frame(Y=rnorm(10),X1=rnorm(10),X2=rnorm(10),X3=rnorm(10))
#split data
splits <- splitdf(data, seed=1234)
my_train <- splits$trainset
my_test <- splits$testset
summary(lm(Y~X1+X2+X3,my_train))$r.squared
#[1] 0.3922881
#split again using same seed...get same results
splits <- splitdf(data, seed=1234)
my_train <- splits$trainset
my_test <- splits$testset
summary(lm(Y~X1+X2+X3,my_train))$r.squared
#[1] 0.3922881
#split using different seed...get different results
splits <- splitdf(data, seed=5555)
my_train <- splits$trainset
my_test <- splits$testset
summary(lm(Y~X1+X2+X3,my_train))$r.squared
#[1] 0.7948203
Upvotes: 0