crashwap
crashwap

Reputation: 3068

Python regex to detect one of multiple optional substrings following a string

I need to match patterns like the following: AAXX#

Where:
* AA is from a set (ie. a list) of 1-3 char alpha prefixes,
* XX is from a different list of pre-defined strings, and
* any single-digit numeral follows.

AA strings: ['bo','h','fr','sam','pe']

XX strings: cl + ['x','n','r','nr','eaner] //OR ELSE JUST// ro

Desired Result: bool indicating whether any of the possible combos match the provided string.

Sample Test Strings:
item = "boro1" - that is, bo + ro + 1
item = "samcl2"- i.e. sam + cl + 2
item = "hcln3" - i.e. h + cln + 3

The best I can figure is to use a loop, but I am having trouble with the essential regex. It works for the single-letter optionals cln, clx, clr, but not for the longer ones clnr, cleaner.

Code:

item = "hclnr2" #h + clnr + 2
out = False
arr = ['bo','h','fr','sam','pe']
for mnrl in arr:
    myrx = re.escape(mnrl) + r'cl[x|n|r|nr|eaner]\d'
    thisone = bool(re.search(myrx, item))
    print('mnrl: '+mnrl+' - ', thisone)
    if thisone: out = True

##########################################################################
# SKIP THIS - INCLUDED IN CASE S/O HAS A BETTER SOLUTION THAN A SECOND LOOP
# THE ABOVE FOR-LOOP handled THE CL[opts] TESTS, THIS LOOP DOES THE RO TESTS
##########################################################################
#if not out: #If not found a match amongst the "cl__" options, test for "ro"
#    for mnrl in arr:
#        myrx = re.escape(mnrl) + r'ro\d'
#        thisone = bool(re.search(myrx, item))
#        print('mnrl: '+mnrl+' - ', thisone)
#    if thisone: out = True
##########################################################################

print('result: ', out)

PRINTS:

mnrl: bo - False
mnrl: h - False <======
mnrl: fr - False
mnrl: sam - False
mnrl: pe - False

However, changing item to:

item = "hcln2" #h + cln + 2

PRINTS:
mnrl: bo - False
mnrl: h - True <========
mnrl: fr - False
mnrl: sam - False
mnrl: pe - False

And ditto for item = hclr5 or item = hclx9 BUT NOT hcleaner9

Upvotes: 0

Views: 194

Answers (2)

SpghttCd
SpghttCd

Reputation: 10890

My approach would be

import re

words = ['boro1', 'samcl2', 'hcln3', 'boro1+unwantedstuff']

p = r'(bo|h|fr|sam|pe)(cl(x|n|r|nr|eaner|)|ro)\d$'

for w in words:
      print(re.match(p, w))

Result:

<_sre.SRE_Match object; span=(0, 5), match='boro1'>
<_sre.SRE_Match object; span=(0, 6), match='samcl2'>    
<_sre.SRE_Match object; span=(0, 5), match='hcln3'>
None

For your desired boolean output you can simply cast the match object to 'bool'.

Upvotes: 2

Jerry
Jerry

Reputation: 71598

Some of the misconceptions in your code include the usage of character classes (syntax: [ ... ]). When you use a character class, any single character from the character class will try to match the string (with the exception where a few other characters are used, these characters being ^ and - when placed in specific positions). This means that:

[x|n|r|nr|eaner]

Will match any one character among: x, |, n, r, e, a (duplicated characters are essentially being discarded)

I'm not entirely sure why you are doing all those intricate things like re.escape in your code, I trust you can understand the snippet below to adapt it to your situation:

import re

def matchPattern(item, extract=False):
    result = re.match(r"(bo|h|fr|sam|pe)((?:cl(?:nr|eaner|[xnr]|))|ro)([0-9])$", item)
    if result:
        if extract:
            return (result.group(1), result.group(2), result.group(3))
        else:
            return True
    else:
        if extract:
            return ('','','')
        else:
            return False

I tweaked the def a little such that you get a boolean if you call for example matchPattern("boro1"), and if you want to get the substring components, you can call matchPattern("boro1", True) and you will get ('bo', 'ro', '1') as result (or ('', '', '') if it doesn't match)

As for the regex itself, you can test it on here (regex101.com)

You need to use groups if you want to use the | regex operator. In the regex I use above,

  • (bo|h|fr|sam|pe) means either one of bo, h, fr, sam or pe
  • ((?:cl(?:nr|eaner|[xnr]|))|ro) means either (?:cl(?:nr|eaner|[xnr]|)) (this means cl followed by either nr, eaner, x, n, r or nothing) or ro
  • ([0-9]) means a number (I prefer this to \d for minor additional performance)

Upvotes: 2

Related Questions