user8810618
user8810618

Reputation: 115

contrasts can only be applied to factors with at least two levels

I want to predict sales using linear regression. This is my data table used for modeling.

> store
     Store Sales CompetitionDistance CompetitionOpenSinceMonth CompetitionOpenSinceYear Promo2 Promo2SinceWeek Promo2SinceYear Assortment_a
  1:     3  8314               14130                        12                     2006      1              14            2011            1
  2:     3  8977               14130                        12                     2006      1              14            2011            1
  3:     3  7610               14130                        12                     2006      1              14            2011            1
  4:     3  8864               14130                        12                     2006      1              14            2011            1
  5:     3  8107               14130                        12                     2006      1              14            2011            1
 ---                                                                                                                                       
775:     3 12247               14130                        12                     2006      1              14            2011            1
776:     3  4523               14130                        12                     2006      1              14            2011            1
777:     3  6069               14130                        12                     2006      1              14            2011            1
778:     3  5902               14130                        12                     2006      1              14            2011            1
779:     3  6823               14130                        12                     2006      1              14            2011            1
     Assortment_b Assortment_c StoreType_a StoreType_b StoreType_c StoreType_d DayOfWeek Open Promo SchoolHoliday DateYear DateMonth
  1:            0            0           1           0           0           0         5    1     1             1     2015         7
  2:            0            0           1           0           0           0         4    1     1             1     2015         7
  3:            0            0           1           0           0           0         3    1     1             1     2015         7
  4:            0            0           1           0           0           0         2    1     1             1     2015         7
  5:            0            0           1           0           0           0         1    1     1             1     2015         7
 ---                                                                                                                                
775:            0            0           1           0           0           0         1    1     1             0     2013         1
776:            0            0           1           0           0           0         6    1     0             0     2013         1
777:            0            0           1           0           0           0         5    1     0             1     2013         1
778:            0            0           1           0           0           0         4    1     0             1     2013         1
779:            0            0           1           0           0           0         3    1     0             1     2013         1
     DateDay DateWeek StateHoliday_0 StateHoliday_a StateHoliday_b StateHoliday_c CompetitionOpen PromoOpen IspromoinSales Prediction
  1:      31       30              1              0              0              0             103     52.00              1          0
  2:      30       30              1              0              0              0             103     52.00              1          0
  3:      29       30              1              0              0              0             103     52.00              1          0
  4:      28       30              1              0              0              0             103     52.00              1          0
  5:      27       30              1              0              0              0             103     52.00              1          0
 ---                                                                                                                                 
775:       7        1              1              0              0              0              73     20.75              1          0
776:       5        0              1              0              0              0              73     20.50              1          0
777:       4        0              1              0              0              0              73     20.50              1          0
778:       3        0              1              0              0              0              73     20.50              1          0
779:       2        0              1              0              0              0              73     20.50              1          0
> 

Because I get an error of

contrasts can only be applied to factors with at least two levels

I applicate what @Scott said here because I don't have any NA values.

I need to know what are columns that should be converted as factor variables in the model.

  > lapply(store, function(x) ifelse(is.factor(x) | is.integer(x), levels(factor(x)), "numeric"))
$Store
[1] "3"

$Sales
[1] "numeric"

$CompetitionDistance
[1] "14130"

$CompetitionOpenSinceMonth
[1] "12"

$CompetitionOpenSinceYear
[1] "2006"

$Promo2
[1] "1"

$Promo2SinceWeek
[1] "14"

$Promo2SinceYear
[1] "2011"

$Assortment_a
[1] "1"

$Assortment_b
[1] "0"

$Assortment_c
[1] "0"

$StoreType_a
[1] "1"

$StoreType_b
[1] "0"

$StoreType_c
[1] "0"

$StoreType_d
[1] "0"

$DayOfWeek
[1] "1"

$Open
[1] "1"

$Promo
[1] "0"

$SchoolHoliday
[1] "0"

$DateYear
[1] "numeric"

$DateMonth
[1] "numeric"

$DateDay
[1] "numeric"

$DateWeek
[1] "numeric"

$StateHoliday_0
[1] "1"

$StateHoliday_a
[1] "0"

$StateHoliday_b
[1] "0"

$StateHoliday_c
[1] "0"

$CompetitionOpen
[1] "numeric"

$PromoOpen
[1] "numeric"

$IspromoinSales
[1] "numeric"

$Prediction
[1] "numeric"

Then my model is shown below. Just look to the lm function how do I write it.

M<-matrix(0,nrow=10,ncol = 1)
store <- data[Store == 3,]  # Pour sélectionner un magasin identifié par son numéro unique
shuffledIndices <- sample(nrow(store))  # Pour faire melanger les données et les réarranger
setDT(store)[,Prediction:=0]
z <- nrow(store)
for (i in 1:10) 
{    # 10-fold cross-validation
  sampleIndex <- floor(1+0.1*(i-1)*z):(0.1*i*z)  # 10 % de la totalité de la base est sélectionné
  test <- store[shuffledIndices[sampleIndex],]  # il est utilisé comme base de test
  train <- store[shuffledIndices[-sampleIndex],]  # il est utilisé comme base de train
  modell <- lm(Sales ~ as.factor(CompetitionDistance) + as.factor(CompetitionOpenSinceMonth) + as.factor(CompetitionOpenSinceYear) + 
                 as.factor(Promo2)+as.factor(Promo2SinceWeek)+as.factor(Promo2SinceYear)+as.factor(Assortment_a)+as.factor(Assortment_b)+as.factor(Assortment_c)+
                 as.factor(StoreType_a)+as.factor(StoreType_b)+as.factor(StoreType_c)+as.factor(StoreType_d)+as.factor(DayOfWeek)+as.factor(Open)+SchoolHoliday+
                 as.factor(Promo)+as.factor(StateHoliday_0)+as.factor(StateHoliday_a)+as.factor(StateHoliday_b)+as.factor(StateHoliday_c)+
                 as.factor(DateYear)+as.factor(DateMonth)+as.factor(DateDay)+as.factor(DateWeek)+as.factor(CompetitionOpen)+as.factor(PromoOpen)+as.factor(IspromoinSales),train)  # a linear model is fitted to the training set
  store[shuffledIndices[sampleIndex],Prediction:=predict(modell,test)] # predictions are generated for the test set based on the model
  M[i,1]<-(round(sqrt(mean((store$Prediction-test$Sales)^2))/mean(test$Sales),4))
}

plot(1:10,M[,1],type='b',xlab="i",ylab="rmse%")

But I always get the error. It's really weird. How do you explain this please? Thank you in advance

Upvotes: 0

Views: 1367

Answers (1)

kath
kath

Reputation: 7724

The problem is that you have constant variables in your model. These variables don't add information and thus should excluded from the modelling process.
Why? You want to model Sales given all your other variables. As some of the variables are constant they don't provide any information how Sales changes, as these variables don't change.

If you modify your model in the following way, your code should work:

modell <- lm(Sales ~ as.factor(DayOfWeek) + SchoolHoliday + as.factor(Promo) + 
               as.factor(DateYear) + as.factor(DateMonth) + as.factor(DateDay) + 
               as.factor(DateWeek) + as.factor(CompetitionOpen) + as.factor(PromoOpen), 
             data = train)

One additional remark:
You are transforming all your variables into factors. As for example PromoOpen seems to be a numeric variable, it might be better to keep this variable as numeric. This of course depends on your data and the desired interpretation of your model.

Upvotes: 2

Related Questions