Why do shuffling the rows of the dataset is too much dropping the MLPClassifier accuracy?

Question

Task is to classify text to labelled classes. (Class labels are: 1, 2, ..., 40)

For this task, text is being encoded first using sentence transformer and then MLPClassifier is used for classification.

The accuracy score received is 0.02 when the dataset is shuffled and 0.52 when dataset is not shuffled.

The code snippets of the code is as follows:

Step-1: Loading and extracting the text and class labels from train and test dataset csv.

#load training data from csv
trainData = pd.read_csv("train.csv")
trainData = trainData.sample(frac=1) # <-- Shuffle the training data rows (It reduces the accuracy score from 0.52 to 0.02 ??)

#Load test data from csv
testData = pd.read_csv("test.csv")
testData = testData.sample(frac=1)  # <---- Shuffle the testing data rows (It reduces the accuracy score from 0.52 to 0.02 ??)

# Extracting the text and label from train data
X_train_text = trainData["text"]
y_train = trainData['label']

# Extracting the text and label from test data
X_test_text = testData['text']
y_test = testData['label']

when I do shuffle the train and test data in the above code snippet the accuracy score decreases from 0.52 to 0.02

trainData = trainData.sample(frac=1)  # <-- Shuffle the training data rows (It reduces the accuracy score from 0.52 to 0.02 ??)
testData = testData.sample(frac=1)  # <---- Shuffle the testing data rows (It reduces the accuracy score from 0.52 to 0.02 ??)

The below code is shared for better understanding the after steps followed (if needed).

Step-2: Encoding the sentence using sentence transformer.

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('sentence-t5-base')

X_train_emb = model.encode(X_train_text)
X_test_emb = model.encode(X_test_text)

Step-3: The encoded text is then trained on MLPClassifier for classification and printing the accuracy score as follows:

#initialise the  MLP Classifier
from sklearn.neural_network import MLPClassifier
clf = MLPClassifier(hidden_layer_sizes=(64,32), activation="tanh", learning_rate='adaptive',  max_iter=1000)

model_mlp = clf.fit(X_train_emb,y_train)
model_mlp.score(X_test_emb, y_test)

Why the accuracy is changing so much from 0.52 to 0.02, just by shuffling the dataset?

Why do shuffling the rows of the dataset is too much dropping the MLPClassifier accuracy?

The below code is shared for better understanding the after steps followed (if needed).

Answers (0)

Related Questions