Reputation: 23
For linear regression to predict house prices, I need to make train and test sample of 80% and 20% proportion.
However, some of the variables are factors of which few have just 1 observation under them.
Due to this, when performing random sampling, those factors are in test sample and not in train sample.
Hence when predicting the Sale Price in test set, the error comes:
"Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
factor Exterior1st has new levels ImStucc"
Here is the summary of the train sample of Exterior1st variable:
> summary(train$Exterior1st)
AsbShng AsphShn BrkComm BrkFace CBlock CemntBd HdBoard ImStucc MetalSd Plywood Stone Stucco
11 0 1 36 0 41 173 0 164 78 2 17
VinylSd Wd Sdng WdShing
389 140 17
Here is summary of the test sample of Exterior1st variable:
> summary(test$Exterior1st)
AsbShng AsphShn BrkComm BrkFace CBlock CemntBd HdBoard ImStucc MetalSd Plywood Stone Stucco
4 0 0 8 1 11 37 1 37 22 0 4
VinylSd Wd Sdng WdShing
97 43 3
As you can see the ImStucc factor in this variable is present in the train sample but not in the test sample, due to which the predict function is throws the initial mentioned error.
In my pursuit for this solution, I had come across a function called "stratified".
But that function does not seem to work in R.
There was another solution using dplyr group_by. But here we have to specify the number of observations for each group. This solution is not suitable for this dataset as it would require calculation for each factor.
Another solution provided was for sampling of vector alone and not the data frame. Hence, that solution does not help.
t= sample(c(filtered_data$Exterior1st,sample(filtered_data$Exterior1st,size = 1000, replace = TRUE)))
> table(t)
t
1 3 4 5 6 7 8 9 10 11 12 13 14 15
26 2 74 2 91 375 1 345 168 3 37 848 329 36
The above sampling gives a total of 2337 entries, even though size given is 1000. Hence, this is perhaps not what I'm looking for.
Is there method to create a sample of 80% of the data such that at least 1 factor from each variable is present within this sample.
If there isn't, what is the workaround this situation?
Upvotes: 0
Views: 331
Reputation: 21
Maybe I am misreading, but if you only have 1 observation of a categorical variable, you won't be able to use that factor, lmStucc, in a regression.
I would remove that variable from the model, collect more data, or aggregate it with other factors (if possible). (I would probably not include 2, 5, or 11 either - from the table t, because they also have low observations)
Also, the function sample (when replacement = TRUE) will choose the same observation multiple times. Set it to replacement = FALSE to avoid duplication of entries.
Upvotes: 1