user14421092
user14421092

Reputation: 155

Is there a simpler way to split list into sublists randomly without repeating elements in python?

I would like to split a list into 3 sublists (train, validation, test) using pre-defined ratios. The items should be chosen to the sublists randomly and without repetition. (My first list contains the names of images in a folder which I want to process after the splitting.) I found a working method, but it seems complicated. I'm curious is there a simpler way to do this? My method is:

This is my code:

import random
import os 

# list files in folder
files = os.listdir("C:/.../my_folder")

# define the size of the sets: ~30% validation, ~20% test, ~50% training (remaining goes to training set)
validation_count = int(0.3 * len(files))
test_count = int(0.2 * len(files))
training_count = len(files) - validation_count - test_count

# randomly choose ~20% of files to test set
test_set = random.sample(files, k = test_count)

# remove already chosen files from original list
files_wo_test_set = [f for f in files if f not in test_set]

# randomly chose ~30% of remaining files to validation set
validation_set = random.sample(files_wo_test_set, k = validation_count)

# the remaining files going into the training set
training_set = [f for f in files_wo_test_set if f not in validation_set]

Upvotes: 3

Views: 3219

Answers (3)

Joy
Joy

Reputation: 97

I hope this can help someone. Sklearn has a library that does it easily:

from sklearn.model_selection import train_test_split

X = np.arange(15).reshape((5, 3))
>>> X
array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11],
       [12, 13, 14]])

X_train, X_test =train_test_split(X, test_size=0.3, random_state=42)

>>> X_train
array([[ 6,  7,  8],
       [ 0,  1,  2],
       [ 9, 10, 11]])

>>> X_test
array([[ 3,  4,  5],
       [12, 13, 14]])

Upvotes: 0

Shadowcoder
Shadowcoder

Reputation: 972

I think the answer is self explanatory so I am not adding any explanation.

import random
random.shuffle(files)
k = test_count
set1 = files[:k]
set2 = files[k:1.5k]
set3 = files[1.5k:]

Upvotes: 4

Alan
Alan

Reputation: 2518

I'd recommend looking into the sci-kit learn library, as that contains the train_test_split function to do this for you. However to answer your question using just the random library.

# First shuffle the list randomly
files = os.listdir("C:/.../my_folder")
random.shuffle(files) 

# Then just slice
ratio = int(len(files)/5) # 20%
test_set = files[:ratio]
val_set = files[ratio:1.5*ratio] #30%

Upvotes: 1

Related Questions