Kamble Tanaji
Kamble Tanaji

Reputation: 497

scikit learn train_test_split function not working as expected

I am using train test split function to separate data for training and testing, but function assigns wrong label for separated train test data. Instead of assigning label from expected row it assigns label from 2nd row from expected row. Please, Let me know where i am going wrong ?

data = pd.read_csv('To_Tanaji.csv')
print(data.columns)
print(data.shape)
#plt.hist(train["DiffCorrectLatRawLat"])
#test = pd.read_csv('test.csv')

#np.polyfit(data['DistanceRaw2GPS'], data['DistanceCorrected2GPS'], 2)
Output= data.DistanceCorrected2GPS
Input=data.DistanceRaw2GPS

X_train, X_test, y_train, y_test = train_test_split(Input, Output, test_size=0.2)

Upvotes: 0

Views: 4837

Answers (2)

Aditya Lahiri
Aditya Lahiri

Reputation: 419

I won't suggest turnning off the shuffle parameter in your train_test_split function rather keep your random_state variable fixed for reproducible splits. It's better to split randomly than splitting say the top 20% of the dataset this can skew your data.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(Input, Output, test_size = 0.20, random_state = 0)

If the split labels are wrong you should make sure the Output and Input variables are assigned correctly or not.

Upvotes: 1

Alano
Alano

Reputation: 81

The train_test_split function will shuffle your data by default. If you don't want this, use shuffle=False.

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

If possible, provide your input data (scrambled or not) to reproduce the problem.

Upvotes: 0

Related Questions