OAK
OAK

Reputation: 3176

regular expression python - catch filenames

I'm trying to search for a specific pattern to grab only those files which align with the pattern in a given folder. I need some assistance to develop a regular expression that matches two patterns - i can't seem to find one that will match both. This is the original regular expression i used:

r"^([a-zA-Z]+)__?(\d+).(\d+).(\d+)\.xlsx"

The reason for this search pattern, is that I then extract the name, date (dd-mm-yy) and full-file name into five variables, this allows me to extract the date included in the full-file name which refers to the input date of the file.

for name, day, month, year, fullfilename in files

Now I am trying the following:

files = []
for f in os.listdir(drive):
    match = re.search(r"^([a-zA-Z-]+)__?(\d+).(\d+).(\d+).xlsx$",f)
    if match:
        files.append(match.groups() + (f,))

Sample filenames:

filename_19.01.17.xlsx
filename__04.01.17.xlsx
AB_TEST_DATA-OUTER_13.02.17.xlsx

So the extraction should be the following:

filename, 19, 01, 17, filename_19.01.17.xlsx

Also tried the following:

r"^(([a-zA-Z-]+)(__?)){1,3}(\d+).(\d+).(\d+).xlsx"

Is it possible to have one pattern to match both all files? Or should I split them into two patterns?

Upvotes: 1

Views: 1679

Answers (2)

Jan
Jan

Reputation: 43199

You could go for:

^.+__?(\d{2})\.(\d{2})\.(\d{2})\.xlsx$

Broken down this means:

^         # start of the string
.+        # anything up to the end, giving up as needed
__?       # one or two underscores
(\d{2})\. # exactly two digits, followed by a dot
(\d{2})\.
(\d{2})\.
xlsx      # "xlsx" literally
$         # the end

See a demo on regex101.com. Additionally, have a look at glob().

Upvotes: 1

ZdaR
ZdaR

Reputation: 22974

The pattern here seems to be as:

Firstly, some alphabets, followed by one or more under-scores, the a date in format of xx.xx.xx and the .xlsx format at the end, which can be translated to regex as:

\S+_+(\d+.){3}\.xlsx

Break-Up:

\S+ - matches any non-whitespace character, one or multiple times.

_+ - matches under-score character one or multiple times.

(\d+.){3} - Number in format of xx.xx.xx.

.xlsx - matches the extension of file.

Upvotes: 1

Related Questions