Jose Ramon
Jose Ramon

Reputation: 5444

Aggregate the data of multiple numpy files into one

I have a list which contains 6 different sub-dataset of a dataset. I would like to perform 6 fold cross-validation. Therefore, in a for-loop of 6 steps each time to split my dataset into 2 groups (training that will contain the 5 sub-datasets and test set - contain the leave one sub-dataset). My code looks like:

EDIT(by taking into account the comments):

sets = ['datasets/1.pickle', 'datasets/2.pickle', ..., 'datasets/6.pickle']
for i in range(0,7):
  train_set = sets[:i]+sets[i+1:]
  test_data, test_lbls = crossValidFiles(sets[i]) # returns the data for a specific sub-samlpe, returns two numpy arrays.
  for item in train_set:
      train_set = [(train_data, train_lbls) for crossValidFiles(item) in train_set]
      train_data = np.concatenate([a for (a,b) in train_set], axis = 0)
      train_lbls = np.concatenate([b for (a,b) in train_set], axis = 0)
      #train_data, train_lbls = crossValidFiles(item) # that returns one file at time.

How can I aggregate the files that I return for the training set?

Upvotes: 0

Views: 256

Answers (2)

ymzkala
ymzkala

Reputation: 323

Alternatively to Mason's answer, you can use np.concatenate inside your crossValidFiles function so that whatever code in there is run on the aggregated test data.

import numpy as np

def crossValidFiles(input_file):
    data, labels = some_load_function(input_file)
    return data, labels

def some_load_function(input_file):
    # Check if the input file is a string or list-like
    if isinstance(input_file, str):
        train_array = some_load_function_2(input_file)
    else:
        train_array = np.concatenate([some_load_function_2(f) for f in input_file], axis=0)

    # rest of your code to create variables 'data' and 'labels'
    return data, labels


Link.

Upvotes: 1

Mason Caiby
Mason Caiby

Reputation: 1924

you can use np.concatenate(): np concatenate

e.g.

import numpy as np
t1 = np.array([[1,2,3],[4,5,6]])
t2 = np.array([[7,8,9],[10,11,12]])
train array = np.concatenate((t1,t2), axis=0)

to process your files I would extract the train_data and train_lbls for your data, then just concate a list of each. e.g.:

import numpy as np
t1 = [np.array([[1,2,3],[4,5,6]]), np.array(['train_lbls'])]
t2 = [np.array([[7,8,9],[10,11,12]]), np.array(['train_lbls'])]
train_set = [t1,t2]
train_set = [(train_data, train_lbls) for crossValidFiles(item) in train_set]
train_data = np.concatenate([a for (a,b) in train_set], axis=0)
train_lbls = np.concatenate([b for (a,b) in train_set], axis=0)

Upvotes: 1

Related Questions