Reputation: 107
I am creating a lightGBM Model for prediction using Python. Initially, i did the data split using sklearn.model_selection.train_test_split which resulted into lower Mean absolute error(MAE). Later, i did the split in some other way by splitting the dataframe into two different data frames, df_train and df_test. With this approach, MAE is significantly higher than earlier approach. Is the use of sklearn.model_selection.train_test_split mandatory in LightGBM or data could be splitted in any way? If it is not mandatory, the results should be somewhat similar. In my case, its very different. Looking for suggestions/help.
Upvotes: 0
Views: 429
Reputation: 6260
To always keep the same outcome with sklearn.model_selection.train_test_split
you have to keep the random_state:
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.33, random_state=42)
based on the documentation:
random_state : int, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
otherwise you cannot produce the same result.
If you have the feeling the split does not fit to your dataframe, you should use cross-validation: https://scikit-learn.org/stable/modules/cross_validation.html , there you are avoiding over- and underfitting for a specific train/test split
Upvotes: 1