Reputation: 11192
I have a cat and dog image dataset. I converted into two folders (cat and dog) each folder contains roughly 10000 images. So Far I don't want 10000 images, I need only 2000 images in each folder. How to automate this in python.
I know to delete a file X, I could use os.remove(X)
similarly to delete a folder os.rmdir(dir_)
But I'm wondering how could i delete randomly n files in each folder effectively
So Far I tried,
dogs_dir=os.listdir('dogs')
cats_dir=os.listdir('cats')
selected_dogs = np.random.choice(dogs_dir,8000)
selected_cats = np.random.choice(cats_dir,8000)
for file_ in selected_dogs:
os.remove('dogs/'+file_)
for file_ in selected_cats:
os.remove('cats/'+file_)
The above code does the job for me, but I'm wondering is their effective way so that i could remove complexity in my code.
Any help would be appreciable.
I'm using ubuntu 17.10, For Now linux based solution is sufficient, but If it compatible with windows also then it's more appreciable.
Upvotes: 1
Views: 1884
Reputation:
Use random.sample()
and the pathlib
module:
from pathlib import Path
import random
def delete_images(directory, number_of_images, extension='jpg'):
images = Path(directory).glob(f'*.{extension}')
for image in random.sample(images, number_of_images):
image.unlink()
delete_images('dogs', 8000)
delete_images('cats', 8000)
Path('cats/').glob('*.jpg')
returns a list of Path
objects that represent files in the cats
directory whose filenames end with .jpg
.
random.sample(<something>, 8000)
takes a random sample of 8000 items from a list.
Path().unlink()
deletes a file.
Upvotes: 2
Reputation: 1256
Your code seems okay to me.
A few adjustments I would make:
It's better to use the os
library so it should be cross-platform. This is because, when you write os.remove('dogs/'+file_)
, the /
is not cross platform. Would be better to use os.remove(os.path.join('dogs', file_))
.
You're wasting a lot of space holding the list of filenames to delete (Two lists of 10000 strings). If it doesn't matter to you which images to keep you could save a little bit of space (20%) by slicing:
dogs_delete=os.listdir('dogs')[2000:] # Take the last 8000 images
for file_ in dogs_delete:
os.remove(os.path.join('dogs', file_))
If it does matter which images to keep, better to generate indices (less space):
dogs_dir=os.listdir('dogs')
for num in random.sample(len(dogs_dir), 8000):
os.remove(os.path.join('dogs', dogs_dir[num]))
Upvotes: 3