Sadman Sakib
Sadman Sakib

Reputation: 595

PyTable table split

I have a PyTable table like the following format:

/neg/data.cols (Cols), 5 columns
  data (Column(8909, 256, 256), ('<f4', (256, 256)))
  filename (Column(8909,), |S100)
  id (Column(8909,), uint32)
  label (Column(8909,), uint8)
  offset (Column(8909,), float64)

There is 8909 data within the table and corresponding labels. I want to split this table into train tests for machine learning. Let's say keep 80% rows as training data and 20% as the test.

Is there any utility function that can help me to do it?

Upvotes: 0

Views: 148

Answers (1)

kcw78
kcw78

Reputation: 8006

What are your desired data objects after splitting the table into train and test datasets? 1 NumPy array for each set? Or 4 arrays: train_data, train_labels, test_data, test_labels? (or something else?).

If you are using Keras, there are some other answers posted on StackOverflow that are related to loading batches. They might be helpful. I am sharing links as a reference. Read my answer below you want to code yourself.

For Keras batch loading: Keras: load images batch wise for large dataset and How to split dataset into K-fold without loading the whole dataset at once? If that doesn't help, Google "keras fit_generator" for some tutorials. You will need to write a Python generator function to read and load image arrays from your H5 file.

My answer:
Typically you want to read random rows from the Table. The code below shows how to do that with the Table.read_coordinates() function. It is a standalone example that creates a sample HDF5 file that matches your data schema (same group and dataset layout). It closes the file after it creates the data, then opens 'read-only' to access the data.

The read process first creates 2 lists of random and unique row numbers (test_list and train_list), then uses them to read the row data into various arrays as described above. The 'read-only' part should work "as-is" with your data (just need to change the filename). Once you have the arrays, you can pass them to train and test your model.

import tables as tb
import numpy as np
import random

nrows = 8000
size = 100
ds_dt = np.dtype( [ ('data',(float,(size,size))), ('filename','S100'),
                 ('id',int), ('label',int), ('offset',float) ] )

# Create sample data for use below    
with tb.File('SO_70014301.h5','w') as h5f:
    data_table = h5f.create_table('/neg','data',description=ds_dt,createparents=True)
    for cnt in range(nrows):
        arr = np.random.random(size*size).reshape(size,size)
        data_list = [ (arr, str(f'filename_{cnt+1:003}.jpg'), cnt+1, cnt+1001, 10.*cnt), ] 
        data_table.append(data_list)

# read sample data and extract Table: '/neg/data' aka file.root.neg.data
with tb.File('SO_70014301.h5','r') as h5f:
    data_table = h5f.root.neg.data
    nrows = data_table.nrows
    #create lists with row ids to extract test and training data
    row_list = list(range(nrows))
    test_list = sorted(random.sample(row_list, k=int(0.20*nrows)))
    train_list = list(set(row_list) - set(test_list))
    print(len(row_list),len(test_list),len(train_list))   

    #extract entire training dataset to np.recarray
    train_arr = data_table.read_coordinates(train_list)
    print(train_arr.shape)
    print(train_arr.dtype)

    #extract entire test dataset to np.recarray
    test_arr = data_table.read_coordinates(test_list)
    print(test_arr.shape)
    print(test_arr.dtype)
    
    # extract training data to array
    train_data = data_table.read_coordinates(train_list,field='data')
    # extract training labels to array
    train_labs = data_table.read_coordinates(train_list,field='label')
    print(train_data.shape, train_data.dtype)
    print(train_labs.shape, train_labs.dtype)
    
    # extract test data to array
    test_data = data_table.read_coordinates(test_list,field='data')
    # extract test labels to array
    test_labs = data_table.read_coordinates(test_list,field='label')
    print(test_data.shape, test_data.dtype)
    print(test_labs.shape, test_labs.dtype)

Upvotes: 1

Related Questions