gokhan k
gokhan k

Reputation: 85

KNeighborsClassifier different scores without shuffle

I have a dataset with 155 features. 40143 samples. It is sorted by date (oldest to newest) then I deleted the date column from the dataset.

label is on the first column.

CV results c. %65 (mean accuracy of scores +/- 0.01) with the code below:

def cross(dataset):


     dropz  = ["result"] 

     X = dataset.drop(dropz, axis=1)
     X = preprocessing.normalize(X)

     y = dataset["result"]


     clf = KNeighborsClassifier(n_neighbors=1, weights='distance', n_jobs=-1)

     scores = cross_val_score(clf, X, y, cv=10, scoring='accuracy')

Also I get similar accuracy with the code below:

def train(dataset):
    dropz  = ["result"] 
    X = dataset.drop(dropz, axis=1)
    X = preprocessing.normalize(X)
    y = dataset["result"]

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1000, random_state=42)

    clf = KNeighborsClassifier(n_neighbors=1, weights='distance', n_jobs=-1).fit(X_train, y_train)
    clf.score(X_test, y_test)

But If I don't use shuffle in the code below it results c. %49 If I use shuffle then it results c. %65

I should mention that I try every 1000 consecutive split of all set from end to beginning and the result is same.

dataset = pd.read_csv("./dataset.csv", header=0,sep=";")

dataset = shuffle(dataset) #!!!???

X_train = dataset.iloc[:-1000,1:]
X_train = preprocessing.normalize(X_train)
y_train = dataset.iloc[:-1000,0]


X_test = dataset.iloc[-1000:,1:]
X_test = preprocessing.normalize(X_test)
y_test = dataset.iloc[-1000:,0]

clf = KNeighborsClassifier(n_neighbors=1, weights='distance', n_jobs=-1).fit(X_train, y_train)
clf.score(X_test, y_test)

Upvotes: 0

Views: 157

Answers (1)

ginge
ginge

Reputation: 1972

Assuming your question is "Why does it happen":

In both your first and second code snippets you have underlying shuffling happening (in your cross validation and your train_test_split methods), therefore they are equivalent (both in score and algorithm) to your last snippet with shuffling "on".

Since your original dataset is ordered by date there might be (and usually likely) some data that changes over time, which means that since your classifier never sees data from the last 1000 time points - it is unaware of the change in the underlying distribution and therefore fails to classify it.


Addendum to answer further data in comment:

This suggests that there might be some indicative process that is captured in smaller time frames. Two interesting ways to explore it are:

  1. reduce the size of the test set until you find a window size in which the difference between shuffle/no shuffle is negligible.
  2. this process essentially manifests as some dependence between your features so you could see if in a small time frame there is a dependence between your features

Upvotes: 1

Related Questions