Reputation: 497
I am using train test split function to separate data for training and testing, but function assigns wrong label for separated train test data. Instead of assigning label from expected row it assigns label from 2nd row from expected row. Please, Let me know where i am going wrong ?
data = pd.read_csv('To_Tanaji.csv')
print(data.columns)
print(data.shape)
#plt.hist(train["DiffCorrectLatRawLat"])
#test = pd.read_csv('test.csv')
#np.polyfit(data['DistanceRaw2GPS'], data['DistanceCorrected2GPS'], 2)
Output= data.DistanceCorrected2GPS
Input=data.DistanceRaw2GPS
X_train, X_test, y_train, y_test = train_test_split(Input, Output, test_size=0.2)
Upvotes: 0
Views: 4837
Reputation: 419
I won't suggest turnning off the shuffle
parameter in your train_test_split
function rather keep your random_state
variable fixed for reproducible splits. It's better to split randomly than splitting say the top 20% of the dataset this can skew your data.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(Input, Output, test_size = 0.20, random_state = 0)
If the split labels are wrong you should make sure the Output and Input variables are assigned correctly or not.
Upvotes: 1
Reputation: 81
The train_test_split function will shuffle your data by default. If you don't want this, use shuffle=False.
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
If possible, provide your input data (scrambled or not) to reproduce the problem.
Upvotes: 0