Reputation: 447
Ultimately I am trying to retrieve substrings from file paths which match a particular pattern of characters and numbers. But at this point I'm just trying to get my pattern matching correct. The issue is that I don't always know how many characters are in parts of the matching strings.
For example, the broad pattern is: All uppercase followed by underscore, 4 digit number, underscore, and then mix of upper/lowercase characters and numbers.
filepath/here/of/unknown/length/THISISTHEPATTERN_1111_Iwanttomatch/......
so far, I have tried variations on re.compile(r'\b[A-Z]+_\d{4}_[a-zA-Z][0-9]*')
however this still captures cases where the UPPERCASE part of the pattern is interspersed with lowercase characters.
Here is a test case:
import re
file_path = [
"C:/Users/csadf/asfsf/Test",
"BEAVERHEAD_2020_thisisatestQL1",
"BEAVERHEAD_2020_thisisatest",
"BEAVeRHEAD_2020_thisisatest",
"C:/Users/csadf/asfsf/Test/BEAVERHEAD_2020_thisisaTest1",
"C:/Users/csadf/asfsf/Test/BEAVERHEAD_202_thisisaTest1",
"C:/Users/csadf/asfsf/Test/BEAVERHEAD_2020_thisisaTest1/data/dfjt/rth"]
projectname_pattern = re.compile(r'\b[A-Z]+[a-z]{0}[A-Z]+_\d{4}_[a-zA-Z][0-9]*')
for i in file_path:
if projectname_pattern.search(i):
print("Found a match: ",i)
else:
print("no match")
In this example I want it to return:
no match
Found a match
Found a match
no match
Found a match
no match
Found a match
edit: corrected a typo in the test cases. Also, appears I unwittingly answered my own question. The above regex appears to work.
Upvotes: 1
Views: 217
Reputation: 1731
The test output shows that your regex is working perfectly to your stated pattern:
no match: C:/Users/csadf/asfsf/Test
Found a match: BEAVERHEAD_2020_thisisatestQL1
Found a match: BEAVERHEAD_2020_thisisatest
no match: BEAVeRHEAD_2020_thisisatest
no match: C:/Users/csadf/asfsf/TestBEAVERHEAD_2020_thisisaTest1
no match: C:/Users/csadf/asfsf/Test/BEAVERHEAD_202_thisisaTest1
Found a match: C:/Users/csadf/asfsf/Test/BEAVERHEAD_2020_thisisaTest1/data/dfjt/rth
The case of C:/Users/csadf/asfsf/TestBEAVERHEAD_2020_thisisaTest1 is not matched because there is no word boundary (\b) before the BEAVERHEAD.
Can you double-check your requirements and restate?
Upvotes: 4