d3pd
d3pd

Reputation: 8315

How to identify files that have increasing numbers and a similar form of filename?

I have a directory of files, some of them image files. Some of those image files are a sequence of images. They could be named image-000001.png, image-000002.png and so on, or perhaps 001_sequence.png, 002_sequence.png etc.

How can we identify images that would, to a human, appear by their names to be fairly obviously in a sequence? This would mean identifying only those image filenames that have increasing numbers and all have a similar form of filename.

The similar part of the filename would not be pre-defined.

Upvotes: 2

Views: 550

Answers (2)

Delgan
Delgan

Reputation: 19627

What I would suggest is to use regex trough files and group matching pattern with list of associated numbers from the file-name.

Once this is done, just loop trough the dictionnaries keys and ensure that count of elements is the same that the range of matched numbers.

import re
from collections import defaultdict
from os import listdir

files = listdir("/the/path/")

found_patterns = defaultdict(list)
p = re.compile("(.*?)(\d+)(.*)\.png")

for f in files:
    if p.match(f):
        s = p.search(f)
        pattern = s.group(1) + "___" + s.group(3)
        num = int(s.group(2))
        found_patterns[pattern].append(num)

for pattern, found in found_patterns.items():
    mini, maxi = min(found), max(found)
    if len(found) == maxi - mini + 1:
        print("Pattern correct: %s" % pattern)

Of course, this will not work if there are some missing value but you can use some acceptance error.

Upvotes: 1

tobias_k
tobias_k

Reputation: 82919

You can use a regular expression to get files adhering to a certain pattern, e.g. .*\d+.*\.(jpg|png) for anything, then a number, then more anything, and an image extension.

files = ["image-000001.png", "image-000002.png", "001_sequence.png", 
         "002_sequence.png", "not an image 1.doc", "not an image 2.doc", 
         "other stuff.txt", "singular image.jpg"]

import re
image_files = [f for f in files if re.match(r".*\d+.*\.(jpg|png)", f)]

Now, group those image files by replacing the number with some generic string, e.g. XXX:

patterns = collections.defaultdict(list)
for f in image_files:
    p = re.sub("\d+", "XXX", f)
    patterns[p].append(f)

As a result, patterns is

{'image-XXX.png': ['image-000001.png', 'image-000002.png'], 
 'XXX_sequence.png': ['001_sequence.png', '002_sequence.png']}

Similarly, it should not be too hard to check whether all those numbers are consecutive, but maybe that's not really necessary after all. Note, however, that this will have problems discriminating numbered series such as "series1_001.jpg", and "series2_001.jpg".

Upvotes: 1

Related Questions