Reputation: 85
I have a dataset with 155 features. 40143 samples. It is sorted by date (oldest to newest) then I deleted the date column from the dataset.
label is on the first column.
CV results c. %65 (mean accuracy of scores +/- 0.01) with the code below:
def cross(dataset):
dropz = ["result"]
X = dataset.drop(dropz, axis=1)
X = preprocessing.normalize(X)
y = dataset["result"]
clf = KNeighborsClassifier(n_neighbors=1, weights='distance', n_jobs=-1)
scores = cross_val_score(clf, X, y, cv=10, scoring='accuracy')
Also I get similar accuracy with the code below:
def train(dataset):
dropz = ["result"]
X = dataset.drop(dropz, axis=1)
X = preprocessing.normalize(X)
y = dataset["result"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1000, random_state=42)
clf = KNeighborsClassifier(n_neighbors=1, weights='distance', n_jobs=-1).fit(X_train, y_train)
clf.score(X_test, y_test)
But If I don't use shuffle in the code below it results c. %49 If I use shuffle then it results c. %65
I should mention that I try every 1000 consecutive split of all set from end to beginning and the result is same.
dataset = pd.read_csv("./dataset.csv", header=0,sep=";")
dataset = shuffle(dataset) #!!!???
X_train = dataset.iloc[:-1000,1:]
X_train = preprocessing.normalize(X_train)
y_train = dataset.iloc[:-1000,0]
X_test = dataset.iloc[-1000:,1:]
X_test = preprocessing.normalize(X_test)
y_test = dataset.iloc[-1000:,0]
clf = KNeighborsClassifier(n_neighbors=1, weights='distance', n_jobs=-1).fit(X_train, y_train)
clf.score(X_test, y_test)
Upvotes: 0
Views: 157
Reputation: 1972
Assuming your question is "Why does it happen":
In both your first and second code snippets you have underlying shuffling happening (in your cross validation and your train_test_split methods), therefore they are equivalent (both in score and algorithm) to your last snippet with shuffling "on".
Since your original dataset is ordered by date there might be (and usually likely) some data that changes over time, which means that since your classifier never sees data from the last 1000 time points - it is unaware of the change in the underlying distribution and therefore fails to classify it.
Addendum to answer further data in comment:
This suggests that there might be some indicative process that is captured in smaller time frames. Two interesting ways to explore it are:
Upvotes: 1