KNeighborsClassifier different scores without shuffle

Question

I have a dataset with 155 features. 40143 samples. It is sorted by date (oldest to newest) then I deleted the date column from the dataset.

label is on the first column.

CV results c. %65 (mean accuracy of scores +/- 0.01) with the code below:

def cross(dataset):


     dropz  = ["result"] 

     X = dataset.drop(dropz, axis=1)
     X = preprocessing.normalize(X)

     y = dataset["result"]


     clf = KNeighborsClassifier(n_neighbors=1, weights='distance', n_jobs=-1)

     scores = cross_val_score(clf, X, y, cv=10, scoring='accuracy')

Also I get similar accuracy with the code below:

def train(dataset):
    dropz  = ["result"] 
    X = dataset.drop(dropz, axis=1)
    X = preprocessing.normalize(X)
    y = dataset["result"]

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1000, random_state=42)

    clf = KNeighborsClassifier(n_neighbors=1, weights='distance', n_jobs=-1).fit(X_train, y_train)
    clf.score(X_test, y_test)

But If I don't use shuffle in the code below it results c. %49 If I use shuffle then it results c. %65

I should mention that I try every 1000 consecutive split of all set from end to beginning and the result is same.

dataset = pd.read_csv("./dataset.csv", header=0,sep=";")

dataset = shuffle(dataset) #!!!???

X_train = dataset.iloc[:-1000,1:]
X_train = preprocessing.normalize(X_train)
y_train = dataset.iloc[:-1000,0]


X_test = dataset.iloc[-1000:,1:]
X_test = preprocessing.normalize(X_test)
y_test = dataset.iloc[-1000:,0]

clf = KNeighborsClassifier(n_neighbors=1, weights='distance', n_jobs=-1).fit(X_train, y_train)
clf.score(X_test, y_test)

KNeighborsClassifier different scores without shuffle

Answers (1)

Related Questions