Kenrich
Kenrich

Reputation: 23

R: Create sample with at least one element from each category

For linear regression to predict house prices, I need to make train and test sample of 80% and 20% proportion.
However, some of the variables are factors of which few have just 1 observation under them.
Due to this, when performing random sampling, those factors are in test sample and not in train sample.
Hence when predicting the Sale Price in test set, the error comes:

    "Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : 
    factor Exterior1st has new levels ImStucc"


     

Here is the summary of the train sample of Exterior1st variable:

    > summary(train$Exterior1st)
AsbShng AsphShn BrkComm BrkFace  CBlock CemntBd HdBoard ImStucc MetalSd Plywood   Stone  Stucco 
 11       0       1      36       0      41     173       0     164      78       2      17 
VinylSd Wd Sdng WdShing 
389     140      17 

Here is summary of the test sample of Exterior1st variable:

    > summary(test$Exterior1st)
AsbShng AsphShn BrkComm BrkFace  CBlock CemntBd HdBoard ImStucc MetalSd Plywood   Stone  Stucco 
  4       0       0       8       1      11      37       1      37      22       0       4 
VinylSd Wd Sdng WdShing 
 97      43       3 

As you can see the ImStucc factor in this variable is present in the train sample but not in the test sample, due to which the predict function is throws the initial mentioned error.

In my pursuit for this solution, I had come across a function called "stratified". But that function does not seem to work in R.
There was another solution using dplyr group_by. But here we have to specify the number of observations for each group. This solution is not suitable for this dataset as it would require calculation for each factor.
Another solution provided was for sampling of vector alone and not the data frame. Hence, that solution does not help.

     t= sample(c(filtered_data$Exterior1st,sample(filtered_data$Exterior1st,size = 1000, replace = TRUE)))
     > table(t)
      t
      1   3   4   5   6   7   8   9  10  11  12  13  14  15 
      26   2  74   2  91 375   1 345 168   3  37 848 329  36

The above sampling gives a total of 2337 entries, even though size given is 1000. Hence, this is perhaps not what I'm looking for.

Is there method to create a sample of 80% of the data such that at least 1 factor from each variable is present within this sample.

If there isn't, what is the workaround this situation?

Upvotes: 0

Views: 331

Answers (1)

Luke McDonald
Luke McDonald

Reputation: 21

Maybe I am misreading, but if you only have 1 observation of a categorical variable, you won't be able to use that factor, lmStucc, in a regression.

I would remove that variable from the model, collect more data, or aggregate it with other factors (if possible). (I would probably not include 2, 5, or 11 either - from the table t, because they also have low observations)


Also, the function sample (when replacement = TRUE) will choose the same observation multiple times. Set it to replacement = FALSE to avoid duplication of entries.

Upvotes: 1

Related Questions