bbal20
bbal20

Reputation: 133

Regex find match within number range

I have a series of files with the following naming convention..."2020.01.01 W1 Forecast.xlsm". I am trying to loop through a directory while searching the file title pattern that matches the year 2020 and greater or at least a broader range (i.e. 2020-2030) so I don't have to alter my script every year. I've tried the following but have been unsuccessful in getting the pattern to match anything other than the current year of 2020. The naming convention starts with the year string.

path_str = '/Users/X/Desktop/Test_Directory/'

pattern_str = '*2020.*Forecast.xlsm'

p = Path(path_str)
files = p.rglob(pattern_str)

for file in files:
    print(file)

sample output:

/Users/X/Desktop/Test_Directory/2020.08.03 Week 32 Forecast.xlsm
/Users/X/Desktop/Test_Directory/2020.01.06 Week 2 Forecast.xlsm
/Users/X/Desktop/Test_Directory/2020.06.18 Week 25 Forecast.xlsm
/Users/X/Desktop/Test_Directory/2020.06.22 Week 26 Forecast.xlsm

Any help or direction is greatly appreciated.

Upvotes: 0

Views: 993

Answers (3)

mosc9575
mosc9575

Reputation: 6367

I am not sure how far you want to go, but if your goal is only to identify the year in a range from 2020-2030, than this is your regular expresion for you complete path: ^.*20(2\d|30).*$.

Because your are working with a path, I would suggest that you split your string on the last slash / before using the regular expression on the last item of the list. Than you are able to specify your regular expression for the file name.

Maybe this will help:

import re
for file in files:
    my_string = file.split('/')[-1]
    match = re.find('^20(2\d|30).*\.xml$', my_string)
    if match:
        print(file)

Maybe try yourself with this tool.

I also want to add some more information about regular expressions, so you can understand what is going on.

  1. ^ - This looks for the start of the string. The reason why some answers were not successful so far.

  2. . - This looks for any symbol. So you can easily overcome some uninteresting parts. But be careful, because of this you have to specify a dot like this \.

  3. $ - This means the end of the string.

  4. \d - this is a synonym for digit and matches [0-9]

  5. * - this is the greedy symbol. This trys to match from zero to as many items as possible of the wanted type. Examples:

    a. .* - This try to find as many symbols as possible, no type defines.

    b. \d* - This trys to find as many digits as possible.

  6. + - this is also a greedy symbol but has to match at least once.

Upvotes: 0

Talon
Talon

Reputation: 1894

In your second pattern, you're missing a . wildcard after the year, You probably want

^(202[0-9]|2030).*Forecast\.xlsm

rather than

^(202[0-9]|2030)*Forecast.xlsm

You can use a site like https://regexr.com/ to experiment with regexes.

But you might want to consider fetching the most recent files with programming logic instead of regexes, you could parse the file name and select a date range e.g. using datetime.


Update

Starting from your updated code:

import datetime
path_str = '/Users/X/Desktop/Test_Directory/'
pattern_str = '*Forecast.xlsm'  # All your report files

p = Path(path_str)
files = p.rglob(pattern_str)

for file in files:
    # # uncomment in case there are different patterns in that folder:
    # if not re.match(r"\d{4}\.\d{2}.\d{2}.*", file.name): continue
    date = datetime.datetime.strptime(file.name[:10], "%Y.%m.%d")
    current_year = datetime.datetime.today().year
    if date > datetime.datetime(current_year, 1, 1):
        print(date)

This will filter your list of files for names in the current year.

Upvotes: 0

pheeper
pheeper

Reputation: 1527

Here's what you're looking for: '^20[2-9][0-9].+(\.xlsm)$'

It says start with 2020 thru 2099, followed by any character . one or more times +, and end with xlsm (\.xlsm)$. Note the backslash in the last part. It is required to escape the period, otherwise it will interrupt it as any character.

Upvotes: 1

Related Questions