swiss_knight
swiss_knight

Reputation: 7831

sort images based on a cluster correspondances list

I have the following working code to sort images according to a cluster list which is a list of tuples: (image_id, cluster_id).
One image can only be in one and only one cluster (there is never the same image in two clusters for example).

I wonder if there is a way to shorten the "for+for+if+if" loops at the end of the code as yet, for each file name, I must check in every pairs in the cluster list, which makes it a little redundant.

    import os
    import re
    import shutil

    srcdir  = '/home/username/pictures/' # 
    if not os.path.isdir(srcdir):
        print("Error, %s is not a valid directory!" % srcdir)
        return None

    pts_cls # is the list of pairs (image_id, cluster_id)

    filelist    = [(srcdir+fn) for fn in os.listdir(srcdir) if  
                  re.search(r'\.jpg$', fn, re.IGNORECASE)]
    filelist.sort(key=lambda var:[int(x) if x.isdigit() else  
                  x for x in re.findall(r'[^0-9]|[0-9]+', var)])

    for f in filelist:
        fbname  = os.path.splitext(os.path.basename(f))[0]

        for e,cls in enumerate(pts_cls): # for each (img_id, clst_id) pair
            if str(cls[0])==fbname: # check if image_id corresponds to file basename on disk)
                if cls[1]==-1: # if cluster_id is -1 (->noise)
                    outdir = srcdir+'cluster_'+'Noise'+'/'
                else:
                    outdir = srcdir+'cluster_'+str(cls[1])+'/' 

                if not os.path.isdir(outdir):
                    os.makedirs(outdir)

                dstf = outdir+os.path.basename(f)
                if os.path.isfile(dstf)==False:
                    shutil.copy2(f,dstf) 

Of course, as I am pretty new to Python, any other well explained improvements are welcome!

Upvotes: 0

Views: 60

Answers (1)

zwer
zwer

Reputation: 25789

I think you're complicating this far more than needed. Since your image names are unique (there can only be one image_id) you can safely convert pts_cls into a dict and have fast lookups on the spot instead of looping through the list of pairs each and every time. You are also utilizing regex where its not needed and you're packing your paths only to unpack them later.

Also, your code would break if it happens that an image from your source directory is not in the pts_cls as its outdir would never be set (or worse, its outdir would be the one from the previous loop).

I'd streamline it like:

import os
import shutil

src_dir = "/home/username/pictures/"

if not os.path.isdir(src_dir):
    print("Error, %s is not a valid directory!" % src_dir)
    exit(1)  # return is expected only from functions

pts_cls = []  # is the list of pairs (image_id, cluster_id), load from whereever...

# convert your pts_cls into a dict - since there cannot be any images in multiple clusters
# base image name is perfectly ok to use as a key for blazingly fast lookups later
cluster_map = dict(pts_cls)

# get only `.jpg` files; store base name and file name, no need for a full path at this time
files = [(fn[:-4], fn) for fn in os.listdir(src_dir) if fn.lower()[-4:] == ".jpg"]
# no need for sorting based on your code

for name, file_name in files:  # loop through all files
    if name in cluster_map:  # proceed with the file only if in pts_cls
        cls = cluster_map[name]  # get our cluster value
        # get our `cluster_<cluster_id>` or `cluster_Noise` (if cluster == -1) target path
        target_dir = os.path.join(src_dir, "cluster_" + str(cls if cls != -1 else "Noise"))
        target_file = os.path.join(target_dir, file_name)  # get the final target path
        if not os.path.exists(target_file):  # if the target file doesn't exists
            if not os.path.isdir(target_dir):  # make sure our target path exists
                os.makedirs(target_dir, exist_ok=True)  # create a full path if it doesn't
            shutil.copy(os.path.join(src_dir, file_name), target_file)  # copy

UPDATE - If you have multiple 'special' folders for certain cluster IDs (like Noise is for -1) you can create a map like cluster_targets = {-1: "Noise"} where the keys are your cluster IDs and their values are, obviously, the special names. Then you can replace the target_dir generation with: target_dir = os.path.join(src_dir, "cluster_" + str(cluster_targets.get(cls,cls)))

UPDATE #2 - Since your image_id values appear to be integers while filenames are strings, I'd suggest you to just build your cluster_map dict by converting your image_id parts to strings. That way you'd be comparing likes to likes without the danger of type mismatch:

cluster_map = {str(k): v for k, v in pts_cls}

If you're sure that none of the *.jpg files in your src_dir will have a non-integer in their name you can instead convert the filename into an integer to begin with in the files list generation - just replace fn[:-4] with int(fn[:-4]). But I wouldn't advise that as, again, you never know how your files might be named.

Upvotes: 1

Related Questions