Reputation: 3068
I need to match patterns like the following: AAXX#
Where:
* AA
is from a set (ie. a list
) of 1-3 char alpha prefixes,
* XX
is from a different list of pre-defined strings, and
* any single-digit numeral
follows.
AA strings: ['bo','h','fr','sam','pe']
XX strings: cl
+ ['x','n','r','nr','eaner]
//OR ELSE JUST// ro
Desired Result: bool indicating whether any of the possible combos match the provided string.
Sample Test Strings:
item = "boro1"
- that is, bo
+ ro
+ 1
item = "samcl2"
- i.e. sam
+ cl
+ 2
item = "hcln3"
- i.e. h
+ cln
+ 3
The best I can figure is to use a loop, but I am having trouble with the essential regex. It works for the single-letter optionals cln, clx, clr
, but not for the longer ones clnr, cleaner
.
item = "hclnr2" #h + clnr + 2
out = False
arr = ['bo','h','fr','sam','pe']
for mnrl in arr:
myrx = re.escape(mnrl) + r'cl[x|n|r|nr|eaner]\d'
thisone = bool(re.search(myrx, item))
print('mnrl: '+mnrl+' - ', thisone)
if thisone: out = True
##########################################################################
# SKIP THIS - INCLUDED IN CASE S/O HAS A BETTER SOLUTION THAN A SECOND LOOP
# THE ABOVE FOR-LOOP handled THE CL[opts] TESTS, THIS LOOP DOES THE RO TESTS
##########################################################################
#if not out: #If not found a match amongst the "cl__" options, test for "ro"
# for mnrl in arr:
# myrx = re.escape(mnrl) + r'ro\d'
# thisone = bool(re.search(myrx, item))
# print('mnrl: '+mnrl+' - ', thisone)
# if thisone: out = True
##########################################################################
print('result: ', out)
mnrl: bo - False
mnrl: h - False <======
mnrl: fr - False
mnrl: sam - False
mnrl: pe - False
item
to:item = "hcln2" #h + cln + 2
PRINTS:
mnrl: bo - False
mnrl: h - True <========
mnrl: fr - False
mnrl: sam - False
mnrl: pe - False
And ditto for item = hclr5
or item = hclx9
BUT NOT hcleaner9
Upvotes: 0
Views: 194
Reputation: 10890
My approach would be
import re
words = ['boro1', 'samcl2', 'hcln3', 'boro1+unwantedstuff']
p = r'(bo|h|fr|sam|pe)(cl(x|n|r|nr|eaner|)|ro)\d$'
for w in words:
print(re.match(p, w))
Result:
<_sre.SRE_Match object; span=(0, 5), match='boro1'>
<_sre.SRE_Match object; span=(0, 6), match='samcl2'>
<_sre.SRE_Match object; span=(0, 5), match='hcln3'>
None
For your desired boolean output you can simply cast the match object to 'bool'.
Upvotes: 2
Reputation: 71598
Some of the misconceptions in your code include the usage of character classes (syntax: [ ... ]
). When you use a character class, any single character from the character class will try to match the string (with the exception where a few other characters are used, these characters being ^
and -
when placed in specific positions). This means that:
[x|n|r|nr|eaner]
Will match any one character among: x, |, n, r, e, a (duplicated characters are essentially being discarded)
I'm not entirely sure why you are doing all those intricate things like re.escape
in your code, I trust you can understand the snippet below to adapt it to your situation:
import re
def matchPattern(item, extract=False):
result = re.match(r"(bo|h|fr|sam|pe)((?:cl(?:nr|eaner|[xnr]|))|ro)([0-9])$", item)
if result:
if extract:
return (result.group(1), result.group(2), result.group(3))
else:
return True
else:
if extract:
return ('','','')
else:
return False
I tweaked the def
a little such that you get a boolean if you call for example matchPattern("boro1")
, and if you want to get the substring components, you can call matchPattern("boro1", True)
and you will get ('bo', 'ro', '1')
as result (or ('', '', '')
if it doesn't match)
As for the regex itself, you can test it on here (regex101.com)
You need to use groups if you want to use the |
regex operator. In the regex I use above,
(bo|h|fr|sam|pe)
means either one of bo, h, fr, sam or pe((?:cl(?:nr|eaner|[xnr]|))|ro)
means either (?:cl(?:nr|eaner|[xnr]|))
(this means cl followed by either nr, eaner, x, n, r or nothing) or ro([0-9])
means a number (I prefer this to \d
for minor additional performance)Upvotes: 2