Reputation: 5444
I have a list which contains 6 different sub-dataset of a dataset. I would like to perform 6 fold cross-validation. Therefore, in a for-loop of 6 steps each time to split my dataset into 2 groups (training that will contain the 5 sub-datasets and test set - contain the leave one sub-dataset). My code looks like:
EDIT(by taking into account the comments):
sets = ['datasets/1.pickle', 'datasets/2.pickle', ..., 'datasets/6.pickle']
for i in range(0,7):
train_set = sets[:i]+sets[i+1:]
test_data, test_lbls = crossValidFiles(sets[i]) # returns the data for a specific sub-samlpe, returns two numpy arrays.
for item in train_set:
train_set = [(train_data, train_lbls) for crossValidFiles(item) in train_set]
train_data = np.concatenate([a for (a,b) in train_set], axis = 0)
train_lbls = np.concatenate([b for (a,b) in train_set], axis = 0)
#train_data, train_lbls = crossValidFiles(item) # that returns one file at time.
How can I aggregate the files that I return for the training set?
Upvotes: 0
Views: 256
Reputation: 323
Alternatively to Mason's answer, you can use np.concatenate inside your crossValidFiles function so that whatever code in there is run on the aggregated test data.
import numpy as np
def crossValidFiles(input_file):
data, labels = some_load_function(input_file)
return data, labels
def some_load_function(input_file):
# Check if the input file is a string or list-like
if isinstance(input_file, str):
train_array = some_load_function_2(input_file)
else:
train_array = np.concatenate([some_load_function_2(f) for f in input_file], axis=0)
# rest of your code to create variables 'data' and 'labels'
return data, labels
Link.
Upvotes: 1
Reputation: 1924
you can use np.concatenate()
: np concatenate
e.g.
import numpy as np
t1 = np.array([[1,2,3],[4,5,6]])
t2 = np.array([[7,8,9],[10,11,12]])
train array = np.concatenate((t1,t2), axis=0)
to process your files I would extract the train_data
and train_lbls
for your data, then just concate a list of each. e.g.:
import numpy as np
t1 = [np.array([[1,2,3],[4,5,6]]), np.array(['train_lbls'])]
t2 = [np.array([[7,8,9],[10,11,12]]), np.array(['train_lbls'])]
train_set = [t1,t2]
train_set = [(train_data, train_lbls) for crossValidFiles(item) in train_set]
train_data = np.concatenate([a for (a,b) in train_set], axis=0)
train_lbls = np.concatenate([b for (a,b) in train_set], axis=0)
Upvotes: 1