VDedo
VDedo

Reputation: 11

Joined regexes in python sometimes fail

I'm using regexes to validate that some data is in the correct format and having it sometimes fail.

I have data entries multiple fields. Only a few fields are consistent enough in format or never blank to be validated, so I created a small regex to match those. I want to join the fields and the regexes together and call match() once per entry. Although all the individual fields match their respective regexes, the joined check fails.

regs = ['\d.\d','[a-z][a-z]','\d{1,2}'] # list of regexes
indeces = [4, 15, 77] # list of indeces
for line in stdin:

    linesplit = line.rstrip().split('\t')
    testline = []
    for index in indeces:
        testline.append(linesplit[index])

    for i, e in enumerate(testline):
        print i, e, re.match(regs[i], e)

This works fine, but when I try the last bit, it sometimes fails.

fullregex = re.compile('\t'.join(regs))
f = fullregex.match('\t'.join(testline).rstrip())
print 'F', f, '!'

Is there an issue with my fullregex or are the matches I'm getting just false positives?

Upvotes: 1

Views: 105

Answers (1)

Martijn Pieters
Martijn Pieters

Reputation: 1123510

Your individual regular expressions can match within a column, but your joined expression won't.

For example, the input:

['1.1 bar', 'hello', '123 is a number']

matches your individual expressions, but when joined with tabs in between no longer matches the fullregex pattern.

Add [^\t]* as 'padding' into your expressions to allow for extra not-tab characters around the tabs:

fullregex = re.compile(r'[^\t]*\t[^\t]*'.join(regs))

Demo:

>>> import re
>>> regs = ['\d.\d','[a-z][a-z]','\d{1,2}']
>>> testline = ['1.1 bar', 'hello', '123 is a number']
>>> for i, (expr, line) in enumerate(zip(regs, testline)):
...     print i, line, re.match(expr, line)
... 
0 1.1 bar <_sre.SRE_Match object at 0x101c17e68>
1 hello <_sre.SRE_Match object at 0x101c17e68>
2 123 is a number <_sre.SRE_Match object at 0x101c17e68>
>>> re.match('\t'.join(regs), '\t'.join(testline))
>>> re.match(r'[^\t]*\t[^\t]*'.join(regs), '\t'.join(testline))
<_sre.SRE_Match object at 0x101c17e68>

If you wanted your regular expressions to validate the whole column, then either:

  1. test each column with zip() and anchors (^ and $):

    all(re.match('^{}$'.format(expr), col) for expr, col in zip(regs, testline))
    

    where all() only returns True if all expressions matched and the anchors make sure the expression covers the whole column, not just a subset.

  2. or join the columns with \t, but do add anchors as well:

    re.match('^{}$'.format('\t'.join(regs)), '\t'.join(testline))
    

Demo:

>>> regs = ['\d.\d','[a-z][a-z]','\d{1,2}']
>>> testline = ['1.1', 'he', '12']
>>> all(re.match('^{}$'.format(expr), col) for expr, col in zip(regs, testline))
True
>>> re.match('^{}$'.format('\t'.join(regs)), '\t'.join(testline))
<_sre.SRE_Match object at 0x101caa780>
>>> testline = ['1.1 bar', 'hello', '123 is a number']
>>> all(re.match('^{}$'.format(expr), col) for expr, col in zip(regs, testline))
False
>>> re.match('^{}$'.format('\t'.join(regs)), '\t'.join(testline)) is None
True

Upvotes: 1

Related Questions