Alaike3
Alaike3

Reputation: 123

cross_val_predict is not completing. No error message

I'm trying to implement the use of KNearestNeighbors on the MNIST example dataset.

When trying to use cross_val_predict the script just continues to run no matter how long I leave it for.

Is there something I am missing/doing wrong?

Any feedback is appreciated.

from sklearn.datasets import fetch_openml
import numpy as np
mnist = fetch_openml('mnist_784', version=1) #Imports the dataset into the notebook

X, y = mnist["data"], mnist["target"]

y=y.astype(np.uint8)
X=X.astype(np.uint8)#For machine learning models to understand the output must be casted to an interger not a string.

X.shape, y.shape

y=y.astype(np.uint8) #For machine learning models to understand the output must be casted to an interger not a string.
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:] #Separate the data into training and testing sets

from sklearn.neighbors import KNeighborsClassifier

knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train, y_train)

from sklearn.model_selection import cross_val_predict
from sklearn.metrics import f1_score

y_train_knn_pred = cross_val_predict(knn_clf, X_train, y_train, cv=3)

f1_score(y_train, y_train_knn_pred, average="macro")

Upvotes: 1

Views: 243

Answers (2)

Kev1n91
Kev1n91

Reputation: 3693

I think the confusion comes from the point that the KNN algorithm fit call is so much faster than the prediction. From another SO post:

Why is cross_val_predict so much slower than fit for KNeighborsClassifier?

KNN is also called as lazy algorithm because during fitting it does nothing but saves the input data, specifically there is no learning at all.

During predict is the actual distance calculation happens for each test datapoint. Hence, you could understand that when using cross_val_predict, KNN has to predict on the validation data points, which makes the computation time higher!

Therefore a lot of computation power is needed when you look at the size of your input. data. Using multiple cpus or minimizing dimension could be useful.

If you want to use multiple CPU cores you can pass the argument "n_jobs" to cross_val_predict and to KNeighborsClassifierto set the amount of cores to be used. Set it to -1 to use all cores available

Upvotes: 1

seralouk
seralouk

Reputation: 33197

Use the n_jobs=-1

The number of CPUs to use to do the computation. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors

from sklearn.datasets import fetch_openml
import numpy as np
mnist = fetch_openml('mnist_784', version=1) #Imports the dataset into the notebook

X, y = mnist["data"], mnist["target"]
y=y.astype(np.uint8)
X=X.astype(np.uint8)#For machine learning models to understand the output must be casted to an interger not a string.


y=y.astype(np.uint8) #For machine learning models to understand the output must be casted to an interger not a string.
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:] #Separate the data into training and testing sets

from sklearn.neighbors import KNeighborsClassifier

knn_clf = KNeighborsClassifier(n_jobs=-1) # HERE
knn_clf.fit(X_train, y_train) # this took seconds on my macbook pro

from sklearn.model_selection import cross_val_predict
from sklearn.metrics import f1_score

y_train_knn_pred = cross_val_predict(knn_clf, X_train, y_train, cv=3, n_jobs=-1) # AND HERE

f1_score(y_train, y_train_knn_pred, average="macro")

Upvotes: 2

Related Questions