Reputation: 633
I have a files list and I want to split it into 3 parts : training, validation and testing. I have tries this code and I don't know if it is correct or not.
files = glob.glob("/dataset/%s/*" % emotion)
training = files[:int(len(files)*0.8)] #get first 80% of file list
validation = files[-int(len(files)*0.1):] #get middle 10% of file list
testing = files[-int(len(files)*0.1):] #get last 10% of file list
I am not sure if the testing list is duplicated or it the correct last 10% of the file list.
Upvotes: 9
Views: 17480
Reputation: 5301
same as zipa's answer but with a self contained example:
# splitting list of files into 3 train, val, test
import numpy as np
def split_two(lst, ratio=[0.5, 0.5]):
assert(np.sum(ratio) == 1.0) # makes sure the splits make sense
train_ratio = ratio[0]
# note this function needs only the "middle" index to split, the remaining is the rest of the split
indices_for_splittin = [int(len(lst) * train_ratio)]
train, test = np.split(lst, indices_for_splittin)
return train, test
def split_three(lst, ratio=[0.8, 0.1, 0.1]):
import numpy as np
train_r, val_r, test_r = ratio
assert(np.sum(ratio) == 1.0) # makes sure the splits make sense
# note we only need to give the first 2 indices to split, the last one it returns the rest of the list or empty
indicies_for_splitting = [int(len(lst) * train_r), int(len(lst) * (train_r+val_r))]
train, val, test = np.split(lst, indicies_for_splitting)
return train, val, test
files = list(range(10))
train, test = split_two(files)
print(train, test)
train, val, test = split_three(files)
print(train, val, test)
output:
[0 1 2 3 4] [5 6 7 8 9]
[0 1 2 3 4 5 6 7] [8] [9]
Upvotes: 6
Reputation: 20516
Is the testing
script a duplicate of validation
? Yes, you create them in the exact same way, you are extracting the last 10 percent for validation and testing:
files = [1,2,3,4,5,6,7,8,9,10]
training = files[:int(len(files)*0.8)] #[1, 2, 3, 4, 5, 6, 7, 8]
validation = files[-int(len(files)*0.1):] #[10]
testing = files[-int(len(files)*0.1):] #[10]
I suggest you do something like this if you want to stick to your original approach (however the np method is more elegant):
files = [1,2,3,4,5,6,7,8,9,10]
training = files[:int(len(files)*0.8)] #[1, 2, 3, 4, 5, 6, 7, 8]
validation = files[int(len(files)*0.8):int(len(files)*0.9)] #[9]
testing = files[int(len(files)*0.9):] #[10]
Upvotes: 12
Reputation: 27869
You can take advantage of numpy split:
train, validate, test = np.split(files, [int(len(files)*0.8), int(len(files)*0.9)])
Upvotes: 23