W. Kessler
W. Kessler

Reputation: 447

How to use regex to match on consecutive uppercase letters of unknown length

Ultimately I am trying to retrieve substrings from file paths which match a particular pattern of characters and numbers. But at this point I'm just trying to get my pattern matching correct. The issue is that I don't always know how many characters are in parts of the matching strings.

For example, the broad pattern is: All uppercase followed by underscore, 4 digit number, underscore, and then mix of upper/lowercase characters and numbers.

filepath/here/of/unknown/length/THISISTHEPATTERN_1111_Iwanttomatch/......

so far, I have tried variations on re.compile(r'\b[A-Z]+_\d{4}_[a-zA-Z][0-9]*') however this still captures cases where the UPPERCASE part of the pattern is interspersed with lowercase characters.

Here is a test case:

import re

file_path = [
"C:/Users/csadf/asfsf/Test",
"BEAVERHEAD_2020_thisisatestQL1",
"BEAVERHEAD_2020_thisisatest",
"BEAVeRHEAD_2020_thisisatest",
"C:/Users/csadf/asfsf/Test/BEAVERHEAD_2020_thisisaTest1",
"C:/Users/csadf/asfsf/Test/BEAVERHEAD_202_thisisaTest1",
"C:/Users/csadf/asfsf/Test/BEAVERHEAD_2020_thisisaTest1/data/dfjt/rth"]

projectname_pattern = re.compile(r'\b[A-Z]+[a-z]{0}[A-Z]+_\d{4}_[a-zA-Z][0-9]*')


for i in file_path:
    if projectname_pattern.search(i):
        print("Found a match: ",i) 
    else:   
        print("no match") 

In this example I want it to return:

no match
Found a match
Found a match
no match
Found a match
no match
Found a match 

edit: corrected a typo in the test cases. Also, appears I unwittingly answered my own question. The above regex appears to work.

Upvotes: 1

Views: 217

Answers (1)

KateYoak
KateYoak

Reputation: 1731

The test output shows that your regex is working perfectly to your stated pattern:

no match:  C:/Users/csadf/asfsf/Test
Found a match:  BEAVERHEAD_2020_thisisatestQL1
Found a match:  BEAVERHEAD_2020_thisisatest
no match:  BEAVeRHEAD_2020_thisisatest
no match:  C:/Users/csadf/asfsf/TestBEAVERHEAD_2020_thisisaTest1
no match:  C:/Users/csadf/asfsf/Test/BEAVERHEAD_202_thisisaTest1
Found a match:  C:/Users/csadf/asfsf/Test/BEAVERHEAD_2020_thisisaTest1/data/dfjt/rth

The case of C:/Users/csadf/asfsf/TestBEAVERHEAD_2020_thisisaTest1 is not matched because there is no word boundary (\b) before the BEAVERHEAD.

Can you double-check your requirements and restate?

Upvotes: 4

Related Questions