Reputation: 151
I am currently trying to tune hyperparameters using GridSearchCV in scikit-learn using a 'Precision at k' scoring metric which will give me precision if I classify the top kth percentile of my classifier's score as the positive class. I know it is possible to create a custom scorer using make_scorer and creating a score function. This is what I have now:
from sklearn import metrics
from sklearn.grid_search import GridSearchCV
from sklearn.linear_model import LogisticRegression
def precision_at_k(y_true, y_score, k):
df = pd.DataFrame({'true': y_true, 'score': y_score}).sort('score')
threshold = df.iloc[int(k*len(df)),1]
y_pred = pd.Series([1 if i >= threshold else 0 for i in df['score']])
return metrics.precision_score(y_true, y_pred)
custom_scorer = metrics.make_scorer(precision_at_k, needs_proba=True, k=0.1)
X = np.random.randn(100, 10)
Y = np.random.binomial(1, 0.3, 100)
train_index = range(0, 70)
test_index = range(70, 100)
train_x = X[train_index]
train_Y = Y[train_index]
test_x = X[test_index]
test_Y = Y[test_index]
clf = LogisticRegression()
params = {'C': [0.01, 0.1, 1, 10]}
clf_gs = GridSearchCV(clf, params, scoring=custom_scorer)
clf_gs.fit(train_x, train_Y)
However, attempting to call fit
gives me Exception: Data must be 1-dimensional
and I'm not sure why. Can anyone help? Thanks in advance.
Upvotes: 5
Views: 4712
Reputation: 187
Arguments for pd.DataFrame should be 'list' not 'numpy.arrays'
So, just try converting y_true to python list...
df = pd.DataFrame({'true': y_true.tolist(), 'score': y_score.tolist()}).sort('score')
Upvotes: 2