graus
graus

Reputation: 57

scikit-learn RandomForestClassifier produces 'unexpected' results

I'm trying to use sk-learn's RandomForestClassifier for a binary classification task (positive and negative examples). My training data contains 1.177.245 examples with 40 features, in SVM-light format (sparse vectors) which I load using sklearn.dataset's load_svmlight_file. It produces a sparse matrix of 'feature values' (1.177.245 * 40) and one array of 'target classes' (1s and 0s, 1.177.245 of them). I don't know whether this is worrysome, but the trainingdata has 3552 positives and the rest are all negative.

As the sk-learn's RFC doesn't accept sparse matrices, I convert the sparse matrix to a dense array (if I'm saying that right? Lots of 0s for absent features) using .toarray(). I print the matrix before and after converting to arrays and that seems to be going all right.

When I initiate the classifier and start fitting it to the data, it takes this long:

[Parallel(n_jobs=40)]: Done   1 out of  40 | elapsed: 24.7min remaining: 963.3min
[Parallel(n_jobs=40)]: Done  40 out of  40 | elapsed: 27.2min finished

(is that output right? Those 963 minutes take about 2 and a half...)

I then dump it using joblib.dump. When I re-load it:

RandomForestClassifier: RandomForestClassifier(bootstrap=True, compute_importances=True,
        criterion=gini, max_depth=None, max_features=auto,
        min_density=0.1, min_samples_leaf=1, min_samples_split=1,
        n_estimators=1500, n_jobs=40, oob_score=False,
        random_state=<mtrand.RandomState object at 0x2b2d076fa300>,
        verbose=1)

And test it on real trainingdata (consisting out of 750.709 examples, exact same format as training data) I get "unexpected" results. To be exact; only one of the examples in the testingdata is classified as true. When I train on half the initial trainingdata and test on the other half, I get no positives at all.

Now I have no reason to believe anything is wrong with what's happening, it's just that I get weird results, and furthermore I think it's all done awfully quick. It's probably impossible to make a comparison, but training a RFClassifier on the same data using rt-rank (also with 1500 iterations, but with half the cores) takes over 12 hours...

Can anyone enlighten me whether I have any reason to believe something is not working the way it's supposed to? Could it be the ratio of positives to negatives in the training data? Cheers.

Upvotes: 1

Views: 2703

Answers (1)

ogrisel
ogrisel

Reputation: 40159

Indeed this dataset is very very imbalanced. I would advise you to subsample the negative examples (e.g. pick n_positive_samples of them at random) or to oversample the positive example (the latter is more expensive and but might yield better models).

Also are you sure that all your features are numerical features (larger values means something in real life)? If some of them are categorical integer markers, those feature should be exploded as one-of-k boolean encodings instead as scikit-learn implementation of random forest s cannot directly deal with categorical data.

Upvotes: 4

Related Questions