Reputation: 65

How to sort poorly formulated pseudo-numerical strings

I have two lists of rendered image names as input to a deep learning training algorithm, and I need to first group them into related groups (there are several files in each group, with differing numbers of samples, but also several groups for one scene, differing in camera angle).

The problem appears when I see the following types of filenames:

Desired sorting:

Scene_1_Camera_000001.exr
Scene_1_Camera_131072.exr
Scene_1_Camera_0_000001.exr
Scene_1_Camera_0_131072.exr

or:

Scene_1_Camera_0_000001.exr
Scene_1_Camera_0_131072.exr
Scene_1_Camera_000001.exr
Scene_1_Camera_131072.exr

but, actual sorting:

Scene_1_Camera_000001.exr
Scene_1_Camera_0_000001.exr
Scene_1_Camera_0_131072.exr
Scene_1_Camera_131072.exr

The trouble is that the sort goes character-by-character and has no concept of the fact that there may be both a Camera and a Camera_0 (I don't have control over these names, the scenes are historical), and thus uses the _0 from the camera over the sample name, thus breaking up my two groups into three groups.

I am currently using the following code in another place (minus error checking for clarity), and could conceivably use something similar in a custom sorting function, using the prefix as primary sorting key, and the sample number as secondary, but I am afraid that it will be horribly inefficient.

    res = re.search("(.*_)([0-9]{4,6}).([a-zA-Z]{3})", beauty_file)
    prefix = res.group(1)
    #sample = res1.group(2)
    #suffix = res1.group(3)

Is there some way to use a custom sorting function, and to do so efficiently (there are 32000 5MB files)?

[Edit 1]

It seems that I was not quite clear enough on the desired sort order: it needs to be sorted first on scene/camera, and only then on sample number, i.e. the last six digits are the secondary key. Otherwise I would get all sample numbers together, regardless of scenes and cameras, which wouldn't allow me to gulp up grouped files together.

[Edit 2]

I prefer a solution which works with standard Python, as I may not be able to install packages on the machine on which the script will run. I am developing on Windows, due to the great debugger. I had in mind something similar to the comparison customisation typically available in templated C++ sorting functions.

Upvotes: 0

Answers (3)

Carsten Whimster

Reputation: 65

I am not quite sure how to handle this reply when I am answering my own question, but here goes:

jmcampbell's answer led me to the following:

def compare(item1, item2):
    res1 = re.search("(.*_)([0-9]{4,6}).([a-zA-Z]{3})", item1)
    if res1 == None or len(res1.groups()) != 3:
        return item1 < item2
    prefix1 = res1.group(1)
    sample1 = res1.group(2)

    res2 = re.search("(.*_)([0-9]{4,6}).([a-zA-Z]{3})", item2)
    if res2 == None or len(res2.groups()) != 3:
        return item1 < item2
    prefix2 = res2.group(1)
    sample2 = res2.group(2)

    if prefix1 < prefix2:
        return -1
    elif prefix1 > prefix2:
        return 1
    elif sample1 < sample2:
        return -1
    elif sample1 > sample2:
        return 1
    else:
        return 0

Then call the sorting function as follows:

beauty_files.sort(cmp=compare)

This gives the desired sort order.

Thanks for the brainstorming everyone!

Upvotes: 1

Software2

Reputation: 2748

32000 entries isn't a lot - the file size doesn't matter, since you're not editing the files themselves.

There are a few options I could think of:

Rename the entries in your list. Not the files themselves, but the list you've collected. Append _0 to entries that are missing it, and you'll be able to sort easily.
Sort by the last 6 characters (or all characters after the last '_'). There is already a recognizable pattern in your filenames, just omit the part that doesn't match that pattern.
Use a sorting algorithm that can handle that criteria. natsort is a popular option.

Upvotes: 1

jmcampbell

Reputation: 388

If you can be sure that the numbers will always be six digits long, you could use:

listOfFilenames.sort(key=(lambda s: s[-10:-4]))

This takes a slice of the string's last six characters before the extension and sorts by those.

Upvotes: 0

How to sort poorly formulated pseudo-numerical strings

Answers (3)

Related Questions