Why does train_test_split take a long time to run?

I'm using Google colab, and I'm trying to train a convolutional neural network. For splitting the dataset of roughly 11,500 images and each data is of shape 63x63x63. I used train_test_split from sklearn.

test_split = 0.1
random_state = 42
X_train, X_test, y_train, y_test = train_test_split(triplets, df.label, test_size = test_split, random_state = random_state)

Every time my runtime disconnects, I need to run this to proceed further. However this command alone takes close to 10 minutes (or probably more) to run. Every other command in the notebook runs pretty fast (in maybe a few seconds or lesser). I'm not really sure what the issue is; I tried changing the runtime to GPU, and my internet connection seems to be quite stable. What can the issue probably be?

Upvotes: 5

Answers (1)

Quwsar Ohi

Reputation: 637

Why taking so much time?

Your data shape is 11500x63x63x63. It is usual to take this long time, as the data shape is massive.

Explanation: As the data shape is 11500x63x63x63, there are approximately 3x10^9 memory locations in your data (the actual value is 2,875,540,500). Generally, a machine can perform 10^7~10^8 instructions per second. As python is relatively slow, I consider google-colab to be able to execute 10^7 instructions per second, then,

Minimum time needed for train_test_split = 3x10^9 / 10^7 = 300 seconds = 5 minutes

However, the actual time complexity of train_test_split function is almost close to O(n), but due to huge data manipulation, this function results to be having a bottleneck based on the huge data passing and retrieval operations. This results in your script having a time consumption of almost double.

How to solve it?

A simple solution would be to pass the indexes of the feature dataset instead of directly passing the feature dataset (in this case, the feature dataset is triplets). This would cut off the extra time required to copy the returned training and testing features inside of the train_test_split function. This may result a boost in performance depending on the data type you are currently using.

To further explain what I am talking about, I am adding a short code,

# Building a index array of the input feature
X_index = np.arange(0, 11500)

# Passing index array instead of the big feature matrix
X_train, X_test, y_train, y_test = train_test_split(X_index, df.label, test_size=0.1, random_state=42)

# Extracting the feature matrix using splitted index matrix
X_train = triplets[X_train]
X_test = triplets[X_test]

In the above code, I am passing the index of the input features, and splitting it according to the train_test_split function. Further, I am manually extracting the train and test dataset to reduce the time complexity of returning a big matrix.

The estimated time improvement depends on the data type you are currently using. To further strengthen my answer, I am adding a benchmark using NumPy matrix and data types tested on google-colab. The benchmark code and output are given below. However, in some cases, it does not improve too much as seemed in the benchmark.

Code:

import timeit
import numpy as np
from sklearn.model_selection import train_test_split

def benchmark(dtypes):
    for dtype in dtypes:
        print('Benchmark for dtype', dtype, end='\n'+'-'*40+'\n')
        X = np.ones((5000, 63, 63, 63), dtype=dtype)
        y = np.ones((5000, 1), dtype=dtype)
        X_index = np.arange(0, 5000)

        start_time = timeit.default_timer()
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)
        print(f'Time elapsed: {timeit.default_timer()-start_time:.3f}')

        start_time = timeit.default_timer()
        X_train, X_test, y_train, y_test = train_test_split(X_index, y, test_size=0.1, random_state=42)

        X_train = X[X_train]
        X_test = X[X_test]
        print(f'Time elapsed with indexing: {timeit.default_timer()-start_time:.3f}')
        print()

benchmark([np.int8, np.int16, np.int32, np.int64, np.float16, np.float32, np.float64])

Output:

Benchmark for dtype <class 'numpy.int8'>
----------------------------------------
Time elapsed: 0.473
Time elapsed with indexing: 0.304

Benchmark for dtype <class 'numpy.int16'>
----------------------------------------
Time elapsed: 0.895
Time elapsed with indexing: 0.604

Benchmark for dtype <class 'numpy.int32'>
----------------------------------------
Time elapsed: 1.792
Time elapsed with indexing: 1.182

Benchmark for dtype <class 'numpy.int64'>
----------------------------------------
Time elapsed: 2.493
Time elapsed with indexing: 2.398

Benchmark for dtype <class 'numpy.float16'>
----------------------------------------
Time elapsed: 0.730
Time elapsed with indexing: 0.738

Benchmark for dtype <class 'numpy.float32'>
----------------------------------------
Time elapsed: 1.904
Time elapsed with indexing: 1.400
    
Benchmark for dtype <class 'numpy.float64'>
----------------------------------------
Time elapsed: 5.166
Time elapsed with indexing: 3.076

Upvotes: 3