srm
srm

Reputation: 597

Regex matching a string which is a semicolon-separated sequence of tokens

I am trying to produce a Python regex string to validate the values of a column which are comma-separated sequences of unique three-letter codes from a list of three-letter (upper-cased) alphanumeric codes, e.g . the list looks something like ['XA1', 'CZZ', 'BT9', 'WFF',...]. So valid column values could be XA1, XA1;CZZ, or XA1;BT9;WFF; etc. A code cannot occur in the sequence more than once.

Valid sequences must be non-empty, consist of unique codes, and may or may not terminate with a ;, including the case where a sequence only contains one code.

If codes is the list of codes then the regex matching string I constructed from this is

match_str = '?'.join(['({};){}'.format(code, '?' if codes[-1] == code else '') for code in codes])

which gives me, using that example list with only four codes above

'(XA1;)?(CZZ;)?(BT9;)?(WFF;)?'

The regex match queries do produce non-null match objects for what should be valid sequences, e.g.

re.match(match_str, 'XA1;')
re.match(match_str, 'XA1;WFF')
re.match(match_str, 'XA1;')

etc.

In [124]: re.match(match_str, 'anystring')                                                                                        
Out[124]: <_sre.SRE_Match object; span=(0, 0), match=''>

In [125]: re.match(match_str, '')                                                                                                 
Out[125]: <_sre.SRE_Match object; span=(0, 0), match=''>

In [126]: re.match(match_str, 'XA1;something')                                                                                    
Out[126]: <_sre.SRE_Match object; span=(0, 4), match='XA1;'>

I want the results of all three queries above to be null, so I can use a conditional to filter out invalid values, e.g.

if re.match(match_str, val):
     # do something
else:
     # do something else

Upvotes: 2

Views: 805

Answers (2)

srm
srm

Reputation: 597

This is a general non-regex solution to the problem of checking whether a string is a ;-separated sequence of unique/non-repeating codes/tokens (from a fixed set of such tokens). The tokens in the string can be in any order, but only one occurrence of any token is permitted, and the string may or may not terminate with a ;. Each token in the string must also not be surrounded by any spaces.

Example: let the set of tokens be a collection or subset of two-letter country codes such as {'AR', 'CA', 'GB', 'HK', 'IN', 'US'}. Then "valid" strings in the context of this problem can be those such as AR;CA, HK;US;CA;GB;, US, HK;, and invalid strings could be those such as AR;CA, AR;something;HK;, something;AR;GB;US etc.

def is_valid_token_sequence(s, tokens, sep=';'):
    s_tokens = [t for t in s.split(sep) if t]
    token_cntr = collections.Counter(s_tokens).values()
    return not (
        any(t not in tokens for t in s_tokens) or
        any(v > 1 for v in token_cntr)
    )

>>> is_valid_token_sequence('AR;CA', codes)                                                                                                                                                                                        
>>> True

>>> is_valid_token_sequence('HK;US;CA;GB;', codes)                                                                                                                                                                                 
>>> True

>>> is_valid_token_sequence('IN', codes)                                                                                                                                                                                           
>>> True

>>> is_valid_token_sequence('HK;', codes)                                                                                                                                                                                          
>>> True

>>> is_valid_token_sequence(' AR;CA', codes)                                                                                                                                                                                       
>>> False

>>> is_valid_token_sequence('1234;AR;X1;IN;CA', codes)                                                                                                                                                                              
>>> False

>>> is_valid_token_sequence('X1;AR;GB;US', codes)                                                                                                                                                                           
>>> False

Upvotes: 0

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626903

Regular expressions should be avoided in the situation like yours, where you want to fail strings having duplicate chunks.

Use "regular" Python:

codes = ['XA1', 'CZZ', 'BT9', 'WFF']
strs = ['XA1', 'XA1;CZZ', 'XA1;BT9;WFF;', 'XA1;XA1;', 'XA1;something']
for s in strs:
    chunks = s.strip(';').split(';')
    if set(chunks).issubset(codes) and len(chunks) == len(set(chunks)):
        print("{}: Valid!".format(s))
    else:
        print("{}: Invalid!".format(s))

See Python demo online.

NOTES:

  • chunks = s.strip(';').split(';') - removes leading/trailing ; and splits the string with ;
  • if set(chunks).issubset(codes) and len(chunks) == len(set(chunks)): - checks if all the chunks we obtained are a subset of codes and makes sure each item inside chunks is unique.

Regex solution - DON'T USE IN PRODUCTION!

import re
codes = ['XA1', 'CZZ', 'BT9', 'WFF']
block = "(?:{})".format("|".join(codes))
rex =  re.compile( r"^(?!.*\b(\w+)\b.*\b\1\b){0}(?:;{0})*;?$".format(block) )
print(rex)

strs = ['XA1', 'XA1;CZZ', 'XA1;BT9;WFF;', 'XA1;XA1;', 'XA1;something']
for s in strs:
    if rex.match(s):
        print("{}: Valid!".format(s))
    else:
        print("{}: Invalid!".format(s))

See the Python demo

Upvotes: 1

Related Questions