Reputation: 205
I create a List of filenames in a format xxxx_2019-05-20.txt containing all files in a directory. I use os.listdir('path') to build the list.
I would like to create a second list only containing files later that 2019-01-01.
Is there a way of doing this without iterating through each filename and extracting the date from the filename and comparing it against the filterdate (2019-01-01)?
I can do the above, the only problem is I can be looking at very large directories so was just wondering if there's a smarter way to do this. Thanks for the help.
Upvotes: 1
Views: 55
Reputation: 10020
I don't think a time will be a problem here. I constructed a workflow with one million fake filenames and it works ~2.5 seconds for me (I have an average computer). Moreover, I use regular expressions for year extraction so if you want a simplier solution, it will be even faster.
import timeit
s="""from random import choice
import re
names = ('WAKA', 'waka', 'waka-waka', 'wattafak')
dates = ('2018-12-01', '2018-01-01', '2019-01-01', '2019-02-03')
filenames = (
choice(names) + '_' + choice(dates) + '.txt'
for _ in range(1000000)
)
def check_filenames_regex(filenames):
REGEX = re.compile(r'.*_(?P<year>\d{4})-\d\d-\d\d\..+')
result = []
for f in filenames:
r = REGEX.match(f)
if r:
year = r.group('year')
if int(year) >= 2019:
result.append(f)
return result
"""
timeit.timeit('check_filenames_regex(filenames)', setup=s)
returns:
2.742631300352514
If you have folders with less than dozens of million files, a simple brute-force solution should not be a problem.
Upvotes: 2