Reputation: 1080
I am trying to craft a custom scorer function for cross-validating my (binary classification) model in scikit-learn (Python).
Some examples of my raw test data:
Source Feature1 Feature2 Feature3
123 0.1 0.2 0.3
123 0.4 0.5 0.6
456 0.7 0.8 0.9
Assuming that any fold might contain multiple test examples that come from the same source...
Then for the set of examples with the same source, I want my custom scorer to "decide" the "winner" to be the example for which the model spit out the higher probability. In other words, there can be only one correct prediction for each source but if my model claims that more than one evaluation example was "correct" (label=1), I want the example with the highest probability to be matched against the truth by my scorer.
My problem is that the scorer function requires the signature:
score_func(y_true, y_pred, **kwargs)
where y_true
and y_pred
contain the probability/label only.
However, what I really need is:
score_func(y_true_with_source, y_pred_with_source, **kwargs)
so I can group the y_pred_with_source
examples by their source and choose the winner to match against that of the y_true_with_source
truth. Then I can carry on to calculate my precision, for example.
Is there a way I can pass in this information in some way? Maybe the examples' indices?
Upvotes: 3
Views: 867
Reputation: 3316
It sounds like you have a learning-to-rank problem here. You are trying to find the highest-ranked instance out of each group of instances. Learning-to-rank isn't directly supported in scikit-learn right now - scikit-learn pretty much assumes i.i.d. instances - so you'll have to do some extra work.
I think my first suggestion is to drop down a level in the API and use the cross-validation iterators. That would just generate indices for training and validation folds. You would subset your data with those indices and call fit
and predict
on the subsets, with Source removed, and then score it using the Source column.
You can probably hack it in to the cross_val_score approach, but its trickier. In scikit-learn there is a distinction between the score function, which is what you showed above, and the scoring object (which can be a function) taken by cross_val_score. The scoring object is a callable object or function which has signature scorer(estimator, X, y)
. It looks to me like you can define a scoring object that works for your metric. You just have to remove the Source column before sending data to the estimator, and then use that column when computing your metric. If you go this route, I think you will have to wrap the classifier, too, so that its fit
method skips the Source column.
Hope that helps... Good luck!
Upvotes: 1