user2908926
user2908926

Reputation: 205

Filter list of strings by date order where date is part of string

I create a List of filenames in a format xxxx_2019-05-20.txt containing all files in a directory. I use os.listdir('path') to build the list.

I would like to create a second list only containing files later that 2019-01-01.

Is there a way of doing this without iterating through each filename and extracting the date from the filename and comparing it against the filterdate (2019-01-01)?

I can do the above, the only problem is I can be looking at very large directories so was just wondering if there's a smarter way to do this. Thanks for the help.

Upvotes: 1

Views: 55

Answers (1)

vurmux
vurmux

Reputation: 10020

I don't think a time will be a problem here. I constructed a workflow with one million fake filenames and it works ~2.5 seconds for me (I have an average computer). Moreover, I use regular expressions for year extraction so if you want a simplier solution, it will be even faster.

import timeit

s="""from random import choice
import re

names = ('WAKA', 'waka', 'waka-waka', 'wattafak')
dates = ('2018-12-01', '2018-01-01', '2019-01-01', '2019-02-03')

filenames = (
    choice(names) + '_' + choice(dates) + '.txt'
    for _ in range(1000000)
)

def check_filenames_regex(filenames):
    REGEX = re.compile(r'.*_(?P<year>\d{4})-\d\d-\d\d\..+')
    result = []
    for f in filenames:
        r = REGEX.match(f)
        if r:
            year = r.group('year')
            if int(year) >= 2019:
                result.append(f)
    return result
"""

timeit.timeit('check_filenames_regex(filenames)', setup=s)

returns:

2.742631300352514

If you have folders with less than dozens of million files, a simple brute-force solution should not be a problem.

Upvotes: 2

Related Questions