Reputation: 560
I have been working on machine learning model and I am confused about which model to choose or if there is any other technique I should try. I am working on Random Forest to predict the propensity to convert with a higly imbalanced data set. The class balance for the target variable is given below.
label count
0 0.0 1,021,095
1 1.0 4459
The two model I trained was using UpSampling and then using Undersampling. Below are the codes I am using for Upsampling and Undersampling
train_initial, test = new_data.randomSplit([0.7, 0.3], seed = 2018)
train_initial.groupby('label').count().toPandas()
test.groupby('label').count().toPandas()
#Sampling Techniques --- Should be done one of these
#Upsampling ----
df_class_0 = train_initial[train_initial['label'] == 0]
df_class_1 = train_initial[train_initial['label'] == 1]
df_class_1_over = df_class_1.sample(True, 100.0, seed=99)
train_up = df_class_0.union(df_class_1_over)
train_up.groupby('label').count().toPandas()
#Down Sampling
stratified_train = train_initial.sampleBy('label', fractions={0: 3091./714840, 1: 1.0}).cache()
stratified_train.groupby('label').count().toPandas()
Below is how I am training my model
labelIndexer = StringIndexer(inputCol='label',
outputCol='indexedLabel').fit(new_data)
featureIndexer = VectorIndexer(inputCol='features',
outputCol='indexedFeatures',
maxCategories=2).fit(new_data)
from pyspark.ml.classification import RandomForestClassifier
rf_model = RandomForestClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures")
labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel",
labels=labelIndexer.labels)
# Chain indexers and tree in a Pipeline
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, rf_model, labelConverter])
# Search through random forest maxDepth parameter for best model
paramGrid = ParamGridBuilder() \
.addGrid(rf_model.numTrees, [ 200, 400,600,800,1000]) \
.addGrid(rf_model.impurity,['entropy','gini']) \
.addGrid(rf_model.maxDepth,[2,3,4,5]) \
.build()
# Set up 5-fold cross validation
crossval = CrossValidator(estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=BinaryClassificationEvaluator(),
numFolds=5)
train_model = crossval.fit(train_up/stratified_train)
Below are the results from both the methods
#UpSampling - Training
Train Error = 0.184633
precision: 0.8565508112679312
recall: 0.6597217024736883
auroc: 0.9062348758176568
f1 : 0.7453609484359377
#Upsampling - Test
Test Error = 0.0781619
precision: 0.054455645977569946
recall: 0.6503868471953579
auroc: 0.8982212236597943
f1 : 0.10049688048716704
#UnderSampling - Training
Train Error = 0.179293
precision: 0.8468290542023261
recall: 0.781807131280389
f1 : 0.8130201200884863
auroc: 0.9129391668636556
#UnderSamping - Test
Test Error = 0.147874
precision: 0.034453223699706645
recall: 0.778046421663443
f1 : 0.06598453935901905
auroc: 0.8989720777537427
Referring to various articles on StackOverflow I understand that if the test error is lower than the train error there is likely to be error in implementation. However I am not quite sure where am I going error in order to train my models. Also, which sampling is better to use in the case of such an highly imbalanced class. If I do undersampling I am worried if there would be a loss of information.
I was hoping if someone could please help me out with this model and help me to clear my doubts.
Thanks a lot in advance !!
Upvotes: 1
Views: 2388
Reputation: 3652
Testing error lower than Training error does not necessarily mean error in implementation. You can increase the iteration for training the model and depending on your dataset the training error may become lower than the testing error. However, you may end up overfitting. Therefore the goal should also be to check other performance metrics of the test set such as accuracy, precision, recall etc.
Oversampling and undersampling are opposite but roughly equivalent techniques. If you have lots of data points then it is better to undersample. Otherwise go for oversampling. SMOTE is a great technique for oversampling by creating synthetic data points instead of repeating the same data points multiple times.
https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.over_sampling.SMOTE.html
Another tip, shuffle the data with different seeds and see if the training error stays greater than testing error. I suspect the variance in your data is high. Read about the variance-bias trade-off.
Judging by the results, it seems you have built a pretty decent model. Try using XGBoost as well and compare the result with Random Forest.
Upvotes: 2