Reputation: 115
I want to predict sales using linear regression. This is my data table used for modeling.
> store
Store Sales CompetitionDistance CompetitionOpenSinceMonth CompetitionOpenSinceYear Promo2 Promo2SinceWeek Promo2SinceYear Assortment_a
1: 3 8314 14130 12 2006 1 14 2011 1
2: 3 8977 14130 12 2006 1 14 2011 1
3: 3 7610 14130 12 2006 1 14 2011 1
4: 3 8864 14130 12 2006 1 14 2011 1
5: 3 8107 14130 12 2006 1 14 2011 1
---
775: 3 12247 14130 12 2006 1 14 2011 1
776: 3 4523 14130 12 2006 1 14 2011 1
777: 3 6069 14130 12 2006 1 14 2011 1
778: 3 5902 14130 12 2006 1 14 2011 1
779: 3 6823 14130 12 2006 1 14 2011 1
Assortment_b Assortment_c StoreType_a StoreType_b StoreType_c StoreType_d DayOfWeek Open Promo SchoolHoliday DateYear DateMonth
1: 0 0 1 0 0 0 5 1 1 1 2015 7
2: 0 0 1 0 0 0 4 1 1 1 2015 7
3: 0 0 1 0 0 0 3 1 1 1 2015 7
4: 0 0 1 0 0 0 2 1 1 1 2015 7
5: 0 0 1 0 0 0 1 1 1 1 2015 7
---
775: 0 0 1 0 0 0 1 1 1 0 2013 1
776: 0 0 1 0 0 0 6 1 0 0 2013 1
777: 0 0 1 0 0 0 5 1 0 1 2013 1
778: 0 0 1 0 0 0 4 1 0 1 2013 1
779: 0 0 1 0 0 0 3 1 0 1 2013 1
DateDay DateWeek StateHoliday_0 StateHoliday_a StateHoliday_b StateHoliday_c CompetitionOpen PromoOpen IspromoinSales Prediction
1: 31 30 1 0 0 0 103 52.00 1 0
2: 30 30 1 0 0 0 103 52.00 1 0
3: 29 30 1 0 0 0 103 52.00 1 0
4: 28 30 1 0 0 0 103 52.00 1 0
5: 27 30 1 0 0 0 103 52.00 1 0
---
775: 7 1 1 0 0 0 73 20.75 1 0
776: 5 0 1 0 0 0 73 20.50 1 0
777: 4 0 1 0 0 0 73 20.50 1 0
778: 3 0 1 0 0 0 73 20.50 1 0
779: 2 0 1 0 0 0 73 20.50 1 0
>
Because I get an error of
contrasts can only be applied to factors with at least two levels
I applicate what @Scott said here because I don't have any NA values.
I need to know what are columns that should be converted as factor variables in the model.
> lapply(store, function(x) ifelse(is.factor(x) | is.integer(x), levels(factor(x)), "numeric"))
$Store
[1] "3"
$Sales
[1] "numeric"
$CompetitionDistance
[1] "14130"
$CompetitionOpenSinceMonth
[1] "12"
$CompetitionOpenSinceYear
[1] "2006"
$Promo2
[1] "1"
$Promo2SinceWeek
[1] "14"
$Promo2SinceYear
[1] "2011"
$Assortment_a
[1] "1"
$Assortment_b
[1] "0"
$Assortment_c
[1] "0"
$StoreType_a
[1] "1"
$StoreType_b
[1] "0"
$StoreType_c
[1] "0"
$StoreType_d
[1] "0"
$DayOfWeek
[1] "1"
$Open
[1] "1"
$Promo
[1] "0"
$SchoolHoliday
[1] "0"
$DateYear
[1] "numeric"
$DateMonth
[1] "numeric"
$DateDay
[1] "numeric"
$DateWeek
[1] "numeric"
$StateHoliday_0
[1] "1"
$StateHoliday_a
[1] "0"
$StateHoliday_b
[1] "0"
$StateHoliday_c
[1] "0"
$CompetitionOpen
[1] "numeric"
$PromoOpen
[1] "numeric"
$IspromoinSales
[1] "numeric"
$Prediction
[1] "numeric"
Then my model is shown below. Just look to the lm function how do I write it.
M<-matrix(0,nrow=10,ncol = 1)
store <- data[Store == 3,] # Pour sélectionner un magasin identifié par son numéro unique
shuffledIndices <- sample(nrow(store)) # Pour faire melanger les données et les réarranger
setDT(store)[,Prediction:=0]
z <- nrow(store)
for (i in 1:10)
{ # 10-fold cross-validation
sampleIndex <- floor(1+0.1*(i-1)*z):(0.1*i*z) # 10 % de la totalité de la base est sélectionné
test <- store[shuffledIndices[sampleIndex],] # il est utilisé comme base de test
train <- store[shuffledIndices[-sampleIndex],] # il est utilisé comme base de train
modell <- lm(Sales ~ as.factor(CompetitionDistance) + as.factor(CompetitionOpenSinceMonth) + as.factor(CompetitionOpenSinceYear) +
as.factor(Promo2)+as.factor(Promo2SinceWeek)+as.factor(Promo2SinceYear)+as.factor(Assortment_a)+as.factor(Assortment_b)+as.factor(Assortment_c)+
as.factor(StoreType_a)+as.factor(StoreType_b)+as.factor(StoreType_c)+as.factor(StoreType_d)+as.factor(DayOfWeek)+as.factor(Open)+SchoolHoliday+
as.factor(Promo)+as.factor(StateHoliday_0)+as.factor(StateHoliday_a)+as.factor(StateHoliday_b)+as.factor(StateHoliday_c)+
as.factor(DateYear)+as.factor(DateMonth)+as.factor(DateDay)+as.factor(DateWeek)+as.factor(CompetitionOpen)+as.factor(PromoOpen)+as.factor(IspromoinSales),train) # a linear model is fitted to the training set
store[shuffledIndices[sampleIndex],Prediction:=predict(modell,test)] # predictions are generated for the test set based on the model
M[i,1]<-(round(sqrt(mean((store$Prediction-test$Sales)^2))/mean(test$Sales),4))
}
plot(1:10,M[,1],type='b',xlab="i",ylab="rmse%")
But I always get the error. It's really weird. How do you explain this please? Thank you in advance
Upvotes: 0
Views: 1367
Reputation: 7724
The problem is that you have constant variables in your model. These variables don't add information and thus should excluded from the modelling process.
Why? You want to model Sales given all your other variables. As some of the variables are constant they don't provide any information how Sales changes, as these variables don't change.
If you modify your model in the following way, your code should work:
modell <- lm(Sales ~ as.factor(DayOfWeek) + SchoolHoliday + as.factor(Promo) +
as.factor(DateYear) + as.factor(DateMonth) + as.factor(DateDay) +
as.factor(DateWeek) + as.factor(CompetitionOpen) + as.factor(PromoOpen),
data = train)
One additional remark:
You are transforming all your variables into factors. As for example PromoOpen
seems to be a numeric variable, it might be better to keep this variable as numeric. This of course depends on your data and the desired interpretation of your model.
Upvotes: 2