Regex matching a string which is a semicolon-separated sequence of tokens

Question

I am trying to produce a Python regex string to validate the values of a column which are comma-separated sequences of unique three-letter codes from a list of three-letter (upper-cased) alphanumeric codes, e.g . the list looks something like ['XA1', 'CZZ', 'BT9', 'WFF',...]. So valid column values could be XA1, XA1;CZZ, or XA1;BT9;WFF; etc. A code cannot occur in the sequence more than once.

Valid sequences must be non-empty, consist of unique codes, and may or may not terminate with a ;, including the case where a sequence only contains one code.

If codes is the list of codes then the regex matching string I constructed from this is

match_str = '?'.join(['({};){}'.format(code, '?' if codes[-1] == code else '') for code in codes])

which gives me, using that example list with only four codes above

'(XA1;)?(CZZ;)?(BT9;)?(WFF;)?'

The regex match queries do produce non-null match objects for what should be valid sequences, e.g.

re.match(match_str, 'XA1;')
re.match(match_str, 'XA1;WFF')
re.match(match_str, 'XA1;')

etc.

In [124]: re.match(match_str, 'anystring')                                                                                        
Out[124]: <_sre.SRE_Match object; span=(0, 0), match=''>

In [125]: re.match(match_str, '')                                                                                                 
Out[125]: <_sre.SRE_Match object; span=(0, 0), match=''>

In [126]: re.match(match_str, 'XA1;something')                                                                                    
Out[126]: <_sre.SRE_Match object; span=(0, 4), match='XA1;'>

I want the results of all three queries above to be null, so I can use a conditional to filter out invalid values, e.g.

if re.match(match_str, val):
     # do something
else:
     # do something else

Wiktor Stribiżew · Accepted Answer

Regular expressions should be avoided in the situation like yours, where you want to fail strings having duplicate chunks.

Use "regular" Python:

codes = ['XA1', 'CZZ', 'BT9', 'WFF']
strs = ['XA1', 'XA1;CZZ', 'XA1;BT9;WFF;', 'XA1;XA1;', 'XA1;something']
for s in strs:
    chunks = s.strip(';').split(';')
    if set(chunks).issubset(codes) and len(chunks) == len(set(chunks)):
        print("{}: Valid!".format(s))
    else:
        print("{}: Invalid!".format(s))

See Python demo online.

NOTES:

chunks = s.strip(';').split(';') - removes leading/trailing ; and splits the string with ;
if set(chunks).issubset(codes) and len(chunks) == len(set(chunks)): - checks if all the chunks we obtained are a subset of codes and makes sure each item inside chunks is unique.

Regex solution - DON'T USE IN PRODUCTION!

import re
codes = ['XA1', 'CZZ', 'BT9', 'WFF']
block = "(?:{})".format("|".join(codes))
rex =  re.compile( r"^(?!.*\b(\w+)\b.*\b\1\b){0}(?:;{0})*;?$".format(block) )
print(rex)

strs = ['XA1', 'XA1;CZZ', 'XA1;BT9;WFF;', 'XA1;XA1;', 'XA1;something']
for s in strs:
    if rex.match(s):
        print("{}: Valid!".format(s))
    else:
        print("{}: Invalid!".format(s))

See the Python demo

Regex matching a string which is a semicolon-separated sequence of tokens

Answers (2)

Related Questions