Reputation: 833
I have a very large folder of images, as well as a CSV file containing the class labels for each of those images. Because it's all in one giant folder, I'd like to split them up into training/test/validation sets; maybe create three new folders and move images into each based on a Python script of some kind. I'd like to do stratified sampling so I can keep the % of classes the same across all three sets.
What would be the approach to go about making a script that can do this?
Upvotes: 24
Views: 98112
Reputation: 11
There is an easy way to split folders of images into train/test using the split-folders library
import splitfolders
input_folder = 'path/'
# Split with a ratio.
# To only split into training and validation set, set a tuple to `ratio`, i.e, (.8, .2).
# Train, val, test
splitfolders.ratio(input_folder,
output="cell_images2",
seed=42,
ratio=(.7, .2, .1),
group_prefix=None)
Upvotes: 1
Reputation: 429
I used this easy method to split folders of images into train/test/val using the split-folders library. If you are new to this library, you have to install it first using the pip command: pip install split-folders
import splitfolders
image_directory= r'C:\Users\ugoch\Desktop\16 OBU students_Model\CNN from scratch\data_mv'
splitfolders.ratio(image_directory, output="output",
seed=42, ratio=(0.7, 0.15, 0.15), group_prefix=None, move=False) # default values
Copying files: 48 files [00:00, 843.49 files/s]
Upvotes: 1
Reputation: 629
I have developed a python package called python_splitter to automate the whole process in one line. This will auto-generate Train-Test-Val or Train-Test folders . Read more : https://github.com/bharatadk/python_splitter
! pip install python_splitter
import python_splitter
python_splitter.split_from_folder("SOURCE_FOLDER", train=0.5, test=0.3, val=0.2)
**I have made better code which you have to run once 😍**
## I made this for TB vs Normal image datasets by improving above code
## import libraries
import os
import numpy as np
import shutil
import random
# creating train / val /test
root_dir = 'TB_Chest_Radiography_Database/'
new_root = 'AllDatasets/'
classes = ['Normal', 'Tuberculosis']
for cls in classes:
os.makedirs(root_dir + new_root+ 'train/' + cls)
os.makedirs(root_dir +new_root +'val/' + cls)
os.makedirs(root_dir +new_root + 'test/' + cls)
## creating partition of the data after shuffeling
for cls in classes:
src = root_dir + cls # folder to copy images from
print(src)
allFileNames = os.listdir(src)
np.random.shuffle(allFileNames)
## here 0.75 = training ratio , (0.95-0.75) = validation ratio , (1-0.95) =
##training ratio
train_FileNames,val_FileNames,test_FileNames = np.split(np.array(allFileNames),[int(len(allFileNames)*0.75),int(len(allFileNames)*0.95)])
# #Converting file names from array to list
train_FileNames = [src+'/'+ name for name in train_FileNames]
val_FileNames = [src+'/' + name for name in val_FileNames]
test_FileNames = [src+'/' + name for name in test_FileNames]
print('Total images : '+ cls + ' ' +str(len(allFileNames)))
print('Training : '+ cls + ' '+str(len(train_FileNames)))
print('Validation : '+ cls + ' ' +str(len(val_FileNames)))
print('Testing : '+ cls + ' '+str(len(test_FileNames)))
## Copy pasting images to target directory
for name in train_FileNames:
shutil.copy(name, root_dir + new_root+'train/'+cls )
for name in val_FileNames:
shutil.copy(name, root_dir +new_root+'val/'+cls )
for name in test_FileNames:
shutil.copy(name,root_dir + new_root+'test/'+cls )
Upvotes: 6
Reputation: 835
Use the python library split-folder.
pip install split-folders
Let all the images be stored in Data
folder.
Then apply as follows:
import splitfolders
splitfolders.ratio('Data', output="output", seed=1337, ratio=(.8, 0.1,0.1))
On running the above code snippet, it will create 3 folders in the output
directory:
The number of images in each folder can be varied using the values in the ratio
argument(train:val:test)
.
Upvotes: 72
Reputation: 96
Taking Steven White's answer above and altering it a bit as there was a minor issue with the splitting. Also, the files were being saved in the main folder instead of train/test/val folders respectively.
import os
import numpy as np
import shutil
import pandas as pd
def train_test_split():
print("########### Train Test Val Script started ###########")
#data_csv = pd.read_csv("DataSet_Final.csv") ##Use if you have classes saved in any .csv file
root_dir = 'New_folder_to_be_created'
classes_dir = ['class 1', 'class 2', 'class 3', 'class 4']
#for name in data_csv['names'].unique()[:10]:
# classes_dir.append(name)
processed_dir = 'Existing_folder_to_take_images_from'
val_ratio = 0.20
test_ratio = 0.20
for cls in classes_dir:
# Creating partitions of the data after shuffeling
print("$$$$$$$ Class Name " + cls + " $$$$$$$")
src = processed_dir +"//" + cls # Folder to copy images from
allFileNames = os.listdir(src)
np.random.shuffle(allFileNames)
train_FileNames, val_FileNames, test_FileNames = np.split(np.array(allFileNames),
[int(len(allFileNames) * (1 - (val_ratio + test_ratio))),
int(len(allFileNames) * (1 - val_ratio)),
])
train_FileNames = [src + '//' + name for name in train_FileNames.tolist()]
val_FileNames = [src + '//' + name for name in val_FileNames.tolist()]
test_FileNames = [src + '//' + name for name in test_FileNames.tolist()]
print('Total images: '+ str(len(allFileNames)))
print('Training: '+ str(len(train_FileNames)))
print('Validation: '+ str(len(val_FileNames)))
print('Testing: '+ str(len(test_FileNames)))
# # Creating Train / Val / Test folders (One time use)
os.makedirs(root_dir + '/train//' + cls)
os.makedirs(root_dir + '/val//' + cls)
os.makedirs(root_dir + '/test//' + cls)
# Copy-pasting images
for name in train_FileNames:
shutil.copy(name, root_dir + '/train//' + cls)
for name in val_FileNames:
shutil.copy(name, root_dir + '/val//' + cls)
for name in test_FileNames:
shutil.copy(name, root_dir + '/test//' + cls)
print("########### Train Test Val Script Ended ###########")
train_test_split()
Upvotes: 7
Reputation: 31
I had similar task. My images and corresponding annotations in XML format were stored in one folder. I made train and test folder but I used origin folder as validation folder after splitting files (see the script).
Here is my script to split files into into test/training/validation sets:
import os
from random import choice
import shutil
#arrays to store file names
imgs =[]
xmls =[]
#setup dir names
trainPath = 'train'
valPath = 'val'
testPath = 'test'
crsPath = 'img' #dir where images and annotations stored
#setup ratio (val ratio = rest of the files in origin dir after splitting into train and test)
train_ratio = 0.8
test_ratio = 0.1
#total count of imgs
totalImgCount = len(os.listdir(crsPath))/2
#soring files to corresponding arrays
for (dirname, dirs, files) in os.walk(crsPath):
for filename in files:
if filename.endswith('.xml'):
xmls.append(filename)
else:
imgs.append(filename)
#counting range for cycles
countForTrain = int(len(imgs)*train_ratio)
countForTest = int(len(imgs)*test_ratio)
#cycle for train dir
for x in range(countForTrain):
fileJpg = choice(imgs) # get name of random image from origin dir
fileXml = fileJpg[:-4] +'.xml' # get name of corresponding annotation file
#move both files into train dir
shutil.move(os.path.join(crsPath, fileJpg), os.path.join(trainPath, fileJpg))
shutil.move(os.path.join(crsPath, fileXml), os.path.join(trainPath, fileXml))
#remove files from arrays
imgs.remove(fileJpg)
xmls.remove(fileXml)
#cycle for test dir
for x in range(countForTest):
fileJpg = choice(imgs) # get name of random image from origin dir
fileXml = fileJpg[:-4] +'.xml' # get name of corresponding annotation file
#move both files into train dir
shutil.move(os.path.join(crsPath, fileJpg), os.path.join(testPath, fileJpg))
shutil.move(os.path.join(crsPath, fileXml), os.path.join(testPath, fileXml))
#remove files from arrays
imgs.remove(fileJpg)
xmls.remove(fileXml)
#rest of files will be validation files, so rename origin dir to val dir
os.rename(crsPath, valPath)
#summary information after splitting
print('Total images: ', totalImgCount)
print('Images in train dir:', len(os.listdir(trainPath))/2)
print('Images in test dir:', len(os.listdir(testPath))/2)
print('Images in validation dir:', len(os.listdir(valPath))/2)
Upvotes: 3
Reputation: 160
I ran into a similar problem myself. All my images were stored in two folders. "Project/Data2/DPN+" and "Project/Data2/DPN-". It was a binary classification problem. The two classes were "DPN+" and "DPN-". Both of these class folders had .png in them. My objective was to distribute the dataset into training, validation and testing folders. Each of these new folders will have 2 more folders - "DPN+" and "DPN-" - inside them indicating the class. For partition, I used 70:15:15 distribution. I am a beginner in python so, please let me know if I made any mistakes.
Following is my code:
import os
import numpy as np
import shutil
# # Creating Train / Val / Test folders (One time use)
root_dir = 'Data2'
posCls = '/DPN+'
negCls = '/DPN-'
os.makedirs(root_dir +'/train' + posCls)
os.makedirs(root_dir +'/train' + negCls)
os.makedirs(root_dir +'/val' + posCls)
os.makedirs(root_dir +'/val' + negCls)
os.makedirs(root_dir +'/test' + posCls)
os.makedirs(root_dir +'/test' + negCls)
# Creating partitions of the data after shuffeling
currentCls = posCls
src = "Data2"+currentCls # Folder to copy images from
allFileNames = os.listdir(src)
np.random.shuffle(allFileNames)
train_FileNames, val_FileNames, test_FileNames = np.split(np.array(allFileNames),
[int(len(allFileNames)*0.7), int(len(allFileNames)*0.85)])
train_FileNames = [src+'/'+ name for name in train_FileNames.tolist()]
val_FileNames = [src+'/' + name for name in val_FileNames.tolist()]
test_FileNames = [src+'/' + name for name in test_FileNames.tolist()]
print('Total images: ', len(allFileNames))
print('Training: ', len(train_FileNames))
print('Validation: ', len(val_FileNames))
print('Testing: ', len(test_FileNames))
# Copy-pasting images
for name in train_FileNames:
shutil.copy(name, "Data2/train"+currentCls)
for name in val_FileNames:
shutil.copy(name, "Data2/val"+currentCls)
for name in test_FileNames:
shutil.copy(name, "Data2/test"+currentCls)
Upvotes: 15