Dave
Dave

Reputation: 3956

Case statement matching many regular expressions

Python finally got assignment expressions in version 3.8. But the discussion included:

During the development of this PEP many people (supporters and critics both) have had a tendency to focus on toy examples on the one hand, and on overly complex examples on the other.

The danger of toy examples is twofold: they are often too abstract to make anyone go "ooh, that's compelling", and they are easily refuted with "I would never write it that way anyway".

So here is a real example: I'm writing a DSL parser, I want it to run on Python versions other than the latest, and I can't figure out a Pythonic way to do it.

So if you would "never write it (the bottom code sample) that way", how do you write a case statement in Python 3.7 without using infinite nesting levels?

import re

doc = """
 123 45 6789
   red sky at night
     abc42
       [foo, bar+, $)&%(@]
 ContentType: image/jpeg
"""

pat1 = r'^\s*(\w+)\s*$'
pat2 = r'^\s*(\d+(?:\s+\d+)*)\s*$'
pat3 = r'^\s*(\w+):\s*(\w+\/\w+)\s*$'
pat4 = r'^\s*\[([^]]+)\]\s*$'

# Python 3.8+
for line in doc.splitlines():
    if m := re.match(pat1, line):
        print(f'Type 1: |{m.group(1)}|')
    elif m := re.match(pat2, line):
        print(f'Type 2: |{m.group(1)}|')
    elif m := re.match(pat3, line):
        print(f'Type 3: |{m.group(1)}|{m.group(2)}|')
    elif m := re.match(pat4, line):
        print(f'Type 4: |{m.group(1)}|')
    elif line:
        print(f'Unknown Format: |{line}|')

print('=============')

# Python 3.x <3.8
for line in doc.splitlines():
    m = re.match(pat1, line)
    if m:
        print(f'Type 1: |{m.group(1)}|')
    else:
        m = re.match(pat2, line)
        if m:
            print(f'Type 2: |{m.group(1)}|')
        else:
            m = re.match(pat3, line)
            if m:
                print(f'Type 3: |{m.group(1)}|{m.group(2)}|')
            else:
                m = re.match(pat4, line)
                if m:
                    print(f'Type 4: |{m.group(1)}|')
                elif line:
                    print(f'Unknown Format: |{line}|')

The output doesn't matter, it's a toy example to illustrate a real problem. But for the record running under Python 3.7 generates a syntax error.

Running under Python 3.8 produces:

Type 2: |123 45 6789|
Unknown Format: |   red sky at night|
Type 1: |abc42|
Type 4: |foo, bar+, $)&%(@|
Type 3: |ContentType|image/jpeg|
=============
Type 2: |123 45 6789|
Unknown Format: |   red sky at night|
Type 1: |abc42|
Type 4: |foo, bar+, $)&%(@|
Type 3: |ContentType|image/jpeg|

EDIT:

khelwood's approach is the the most straightforward. Easy to understand at a glance, more so than looping over patterns or dispatching.

It's still way uglier than the Python 3.8 version. I have no idea why anyone would be against assignment expressions or why Python took so long to get them.

# Python 3.7
def process_line(ln):
    m = re.match(pat1, ln)
    if m:
        print(f'Type 1: |{m.group(1)}|')
        return
    m = re.match(pat2, ln)
    if m:
        print(f'Type 2: |{m.group(1)}|')
        return
    m = re.match(pat3, ln)
    if m:
        print(f'Type 3: |{m.group(1)}|{m.group(2)}|')
        return
    m = re.match(pat4, ln)
    if m:
        print(f'Type 4: |{m.group(1)}|')
        return
    print(f'Unknown Format: |{ln}|')

for line in doc.splitlines():
    if line:
        process_line(line)

EDIT(orial) #2:

Now I know why it took so long for Python to implement such a simple and useful idea: PEP 572 Controversy. Disgust with the whole fiasco caused Python's creator to step down permanently, and is a cautionary tale on the perils of design by committee. Shame on those responsible for this loss. </editorial>

Upvotes: 0

Views: 347

Answers (2)

Alain T.
Alain T.

Reputation: 42129

To simulate a switch-statement, I use a one-line helper function:

def switch(v): yield lambda *c:v in c

It can be used with if/elif/else statements

x = 1
for case in switch(x):
    if   case(1):   doSomething()
    elif case(2,3): doSomethingElse()
    elif case(4):   doAnotherThing()
    else:           handleOtherCases()

Or in a more C-like switch structure:

x = 1
for case in switch(x):

    if case(1):
        # do something
        break

    if case(2,3):
        # do something else
        break
else:
    # handle other cases...

Using an improved version of this that accepts an optional matching function, you could make that Python 3.7 code more palatable:

def switch(v, match=None):
    if not match: yield lambda *c:v in c; return
    last = None
    def case(*c):
        nonlocal last
        if c: last = next((r for r in (match(p,v) for p in c) if r),None)
        return last
    yield case


# with a match function, case() without parameter returns the last match value

The Python 3.7 code could then be written like this:

for line in doc.splitlines():

    for case in switch(line, re.match):
        if   case(pat1): print(f'Type 1: |{case().group(1)}|')
        elif case(pat2): print(f'Type 2: |{case().group(1)}|')
        elif case(pat3): print(f'Type 3: |{case().group(1)}|{case().group(2)}|')
        elif case(pat4): print(f'Type 4: |{case().group(1)}|')
        elif line:       print(f'Unknown Format: |{line}|')

Or in a more C-like style (likely when more than a mere print call is made in each case):

for line in doc.splitlines():

    for case in switch(line, re.match):
        if case(pat1): 
            print(f'Type 1: |{case().group(1)}|')
            break

        if case(pat2): 
            print(f'Type 2: |{case().group(1)}|')
            break

        if case(pat3): 
            print(f'Type 3: |{case().group(1)}|{case().group(2)}|')
            break

        if case(pat4): 
            print(f'Type 4: |{case().group(1)}|')
            break

        if line:
           print(f'Unknown Format: |{line}|')

note that the break statements here only go out of the for case in switch... they don't break the for line in ... outer loop.

With Python 3.8, you could could make the switch function simpler and assign the match returns in the conditions if you need them:

def switch(v,match=None):
    if not match: 
        yield lambda *c: v in c
    else:
        yield lambda *c: next((r for r in (match(p,v) for p in c) if r),None)

...

for line in doc.splitlines():

    for case in switch(line, re.match):
        if   m := case(pat1): print(f'Type 1: |{m.group(1)}|')
        elif m := case(pat2): print(f'Type 2: |{m.group(1)}|')
        elif m := case(pat3): print(f'Type 3: |{m.group(1)}|{m.group(2)}|')
        elif m := case(pat4): print(f'Type 4: |{m.group(1)}|')
        elif line:            print(f'Unknown Format: |{line}|')

In both versions you can specify multiple patterns on case conditions (the first match is returned):

for line in doc.splitlines():

    for case in switch(line, re.match):
        if not line: break

        if case(pat1,pat2): 
            print(f'Type 1 or 2: |{case().group(1)}|')
            break

        if case(pat3): 
            print(f'Type 3: |{case().group(1)}|{case().group(2)}|')
            break

        if case(pat4): 
            print(f'Type 4: |{case().group(1)}|')
            break
    else:            
        print(f'Unknown Format: |{line}|')

# output:

Type 1 or 2: |123 45 6789|
Unknown Format: |   red sky at night|
Type 1 or 2: |abc42|
Type 4: |foo, bar+, $)&%(@|
Type 3: |ContentType|image/jpeg|

Upvotes: 0

Roy2012
Roy2012

Reputation: 12543

Here's one way to do it, using a dictionary of patterns (dispatch table). I'm searching for the matching pattern, and then calling the function (lambda, in this case) associated with it.

import re
doc = """123 45 6789
   red sky at night
     abc42
       [foo, bar+, $)&%(@]
 ContentType: image/jpeg
"""

pat0 = r'^\s*(\w+)\s*$'
pat1 = r'^\s*(\d+(?:\s+\d+)*)\s*$'
pat2 = r'^\s*(\w+):\s*(\w+\/\w+)\s*$'
pat3 = r'^\s*\[([^]]+)\]\s*$'

d = OrderedDict([
    (pat0, lambda m: print(f'Type 1: |{m.group(1)}|')), 
    (pat1, lambda m: print(f'Type 2: |{m.group(1)}|')), 
    (pat2, lambda m: print(f'Type 3: |{m.group(1)}|{m.group(2)}|')), 
    (pat3, lambda m: print(f'Type 4: |{m.group(1)}|'))    
])

for line in doc.splitlines():
    # look for the first pattern that matches the line 
    pattern = next((pat for pat in d.keys() if re.match(pat, line)), None)
    if pattern:
        m = re.match(pattern, line)
        d[pattern](m)
    else: 
        print(f'Unknown Format: |{line}|')

Upvotes: 1

Related Questions