eshaa
eshaa

Reputation: 587

How training and test data is split - Keras on Tensorflow

I am currently training my data using neural network and using fit function.

history = model.fit(X, encoded_Y, batch_size=50, nb_epoch=500, validation_split=0.2, verbose=1)

Now I have used validation_split as 20%. What I understood is that my training data will be 80% and testing data will be 20%. I am confused how this data is dealt on back end. Is it like top 80% samples will be taken for training and below 20% percent for testing or the samples are randomly picked from inbetween? If I want to give separate training and testing data, how will I do that using model.fit()??

Moreover, my second concern is how to check if data is fitting well on model? I can see from the results that training accuracy is around 90% while the validation accuracy is around 55%. Does this mean it is the case of over-fitting or Under-fitting?

My last question is what does evaluate returns? Document says it returns the loss but I am already getting loss and accuracy during each epoch (as a return of fit() (in history)). What does accuracy and score returned by evaluate shows? If the accuracy returned by evaluate returns 90%, can I say my data is fitting well, regardless of what individual accuracy and loss was for each epoch?

Below is my Code:

import numpy
import pandas
import matplotlib.pyplot as plt
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from keras.utils import np_utils
from sklearn.model_selection import KFold
from sklearn.metrics import confusion_matrix
import itertools

seed = 7
numpy.random.seed(seed)

dataframe = pandas.read_csv("INPUTFILE.csv", skiprows=range(0, 0))

dataset = dataframe.values
X = dataset[:,0:50].astype(float) # number of cols-1
Y = dataset[:,50]

encoder = LabelEncoder()
encoder.fit(Y)
encoded_Y = encoder.transform(Y)

encoded_Y = np_utils.to_categorical(encoded_Y)
print("encoded_Y=", encoded_Y) 
# baseline model
def create_baseline():
    # create model
    model = Sequential()
    model.add(Dense(5, input_dim=5, kernel_initializer='normal', activation='relu'))
    model.add(Dense(5, kernel_initializer='normal', activation='relu'))
    #model.add(Dense(2, kernel_initializer='normal', activation='sigmoid'))

    model.add(Dense(2, kernel_initializer='normal', activation='softmax'))

    # Compile model
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])  # for binayr classification
        #model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])  # for multi class
    return model
    

model=create_baseline();
history=model.fit(X, encoded_Y, batch_size=50, nb_epoch=500, validation_split = 0.2, verbose=1)

print(history.history.keys())
# list all data in history
print(history.history.keys())
# summarize history for accuracy
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
# summarize history for loss
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()


pre_cls=model.predict_classes(X)    
cm1 = confusion_matrix(encoder.transform(Y),pre_cls)
print('Confusion Matrix : \n')
print(cm1)


score, acc = model.evaluate(X,encoded_Y)
print('Test score:', score)
print('Test accuracy:', acc)

Upvotes: 27

Views: 94445

Answers (2)

cards
cards

Reputation: 5033

No, you cannot split the dataset using Model.fit. Model.fit trains the model and returns a

History object. Its History.history attribute is a record of training loss values and metrics values at successive epochs, as well as validation loss values and validation metrics values (if applicable).

Keras provides an ad hoc function for that, keras.utils.split_dataset.

Here an comparison of splitting the dataset into train and test data:

Header:

import sklearn
import keras


# Sets all random seeds (Python, NumPy, and backend framework)
seed = 123456
keras.utils.set_random_seed(seed)

# shared variables
dataset = tf.data.Dataset.range(12, output_type=tf.int8)
TRAIN_SIZE: float = 0.25
DO_SHUFFLE: bool = False

Split data with Keras:

x_train, x_test = keras.utils.split_dataset(
    dataset,
    left_size=TRAIN_SIZE,
    shuffle=DO_SHUFFLE, # default False
    seed=seed
)

print(list(map(int, x_train.as_numpy_iterator())), list(map(int, x_test.as_numpy_iterator())))
#[0, 1, 2] [3, 4, 5, 6, 7, 8, 9, 10, 11]

Split data with scikit-learn:

keras.utils.set_random_seed(seed)
x_train, x_test = sklearn.model_selection.train_test_split(
    list(map(int, dataset.as_numpy_iterator())), # only Python or NumPy data
    train_size=TRAIN_SIZE,
    shuffle=DO_SHUFFLE, # default True
    random_state=seed
)
print(x_train, x_test)
#[0, 1, 2] [3, 4, 5, 6, 7, 8, 9, 10, 11]

Implicit split data at training time with Keras:

model = keras.Sequential([keras.layers.Identity()])
model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy']) # not important for the example!

history = model.fit(
    *[np.array(list(dataset.as_numpy_iterator()))]*2,
    shuffle=DO_SHUFFLE, # defualt True
    validation_split=1.-TRAIN_SIZE # size of validation is the complement of the training
)
print(history.history)

Output

{
    'accuracy': [1.0], 'loss': [1.909542441368103], #         results of the training
    'val_accuracy': [1.0], 'val_loss': [133.94410705566406] # results of the validation
}

Model.evaluate returns a feedback on how the trained model behaves when it learns the parameters from non-training data:

Scalar test loss (if the model has a single output and no metrics) or list of scalars (if the model has multiple outputs and/or metrics). The attribute model.metrics_names will give you the display labels for the scalar outputs.

Upvotes: 0

sebrojas
sebrojas

Reputation: 914

  1. The keras documentation says:"The validation data is selected from the last samples in the x and y data provided, before shuffling.", this means that the shuffle occurs after the split, there is also a boolean parameter called "shuffle" which is set true as default, so if you don't want your data to be shuffled you could just set it to false

  2. Getting good results on your training data and then getting bad or not so good results on your evaluation data usually means that your model is overfitting, overfit is when your model learns in a very specific scenario and can't achieve good results on new data

  3. evaluation is to test your model on new data that it has "never seen before", usually you divide your data on training and test, but sometimes you might also want to create a third group of data, because if you just adjust your model to obtain better and better results on your test data this in some way is like cheating because in some way you are telling your model how is the data you are going to use for evaluation and this could cause overfitting

Also, if you want to split your data without using keras, I recommend you to use the sklearn train_test_split() function.

it's easy to use and it looks like this:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

Upvotes: 60

Related Questions