Mohamed Thasin ah
Mohamed Thasin ah

Reputation: 11192

how to delete n number of files in directory using python

I have a cat and dog image dataset. I converted into two folders (cat and dog) each folder contains roughly 10000 images. So Far I don't want 10000 images, I need only 2000 images in each folder. How to automate this in python.

I know to delete a file X, I could use os.remove(X) similarly to delete a folder os.rmdir(dir_)

But I'm wondering how could i delete randomly n files in each folder effectively

So Far I tried,

dogs_dir=os.listdir('dogs')
cats_dir=os.listdir('cats')

selected_dogs = np.random.choice(dogs_dir,8000)
selected_cats = np.random.choice(cats_dir,8000)

for file_ in selected_dogs:
    os.remove('dogs/'+file_)

for file_ in selected_cats:
    os.remove('cats/'+file_)    

The above code does the job for me, but I'm wondering is their effective way so that i could remove complexity in my code.

Any help would be appreciable.

I'm using ubuntu 17.10, For Now linux based solution is sufficient, but If it compatible with windows also then it's more appreciable.

Upvotes: 1

Views: 1884

Answers (2)

user3064538
user3064538

Reputation:

Use random.sample() and the pathlib module:

from pathlib import Path
import random

def delete_images(directory, number_of_images, extension='jpg'):
    images = Path(directory).glob(f'*.{extension}')
    for image in random.sample(images, number_of_images):
        image.unlink()

delete_images('dogs', 8000)
delete_images('cats', 8000)    

Path('cats/').glob('*.jpg') returns a list of Path objects that represent files in the cats directory whose filenames end with .jpg.

random.sample(<something>, 8000) takes a random sample of 8000 items from a list.

Path().unlink() deletes a file.

Upvotes: 2

Zionsof
Zionsof

Reputation: 1256

Your code seems okay to me.

A few adjustments I would make:

  1. It's better to use the os library so it should be cross-platform. This is because, when you write os.remove('dogs/'+file_), the / is not cross platform. Would be better to use os.remove(os.path.join('dogs', file_)).

  2. You're wasting a lot of space holding the list of filenames to delete (Two lists of 10000 strings). If it doesn't matter to you which images to keep you could save a little bit of space (20%) by slicing:

    dogs_delete=os.listdir('dogs')[2000:]  # Take the last 8000 images
    for file_ in dogs_delete:
        os.remove(os.path.join('dogs', file_))
    

    If it does matter which images to keep, better to generate indices (less space):

    dogs_dir=os.listdir('dogs')
    for num in random.sample(len(dogs_dir), 8000):
        os.remove(os.path.join('dogs', dogs_dir[num]))
    

Upvotes: 3

Related Questions