Reputation: 323
I am attempting to produce a classification model based on the work of qualitative survey data. About 10K of our customers were researched and as a result a segmentation model was built and subsequently each customer categorised into 1 of 8 customer segments. The challenge is to now classify the TOTAL customer base into those segments. As only certain customers responded the researcher used overall demographics to apply post-stratification weights (or frequency weights).
My task is to now use our customer data as explanatory variables on this 10K in order to build a classification model for the whole base.
In order to handle the customer weights I simply duplicated each customer record by each respective frequency weight and the data set exploded to about 72K. I then split this data into train and test and used the R caret package to train a GBM and using the final chosen model classified my hold-out test set.
I was getting 82% accuracy and thought the results were too good to be true. After thinking about it I think the issue is that the model is inadvertently seeing records in train that are exactly the same in test (some records might be exactly duplicated up to 10 times).
I know that the GLM model function allows you to use the weight parameter to refer to a vector of weights but my question is how to utilise other machine learning algorithms, such as GBM or Random Forests, in R?
Thanks
Upvotes: 0
Views: 374