Reputation: 11
I'm using regexes to validate that some data is in the correct format and having it sometimes fail.
I have data entries multiple fields. Only a few fields are consistent enough in format or never blank to be validated, so I created a small regex to match those. I want to join the fields and the regexes together and call match() once per entry. Although all the individual fields match their respective regexes, the joined check fails.
regs = ['\d.\d','[a-z][a-z]','\d{1,2}'] # list of regexes
indeces = [4, 15, 77] # list of indeces
for line in stdin:
linesplit = line.rstrip().split('\t')
testline = []
for index in indeces:
testline.append(linesplit[index])
for i, e in enumerate(testline):
print i, e, re.match(regs[i], e)
This works fine, but when I try the last bit, it sometimes fails.
fullregex = re.compile('\t'.join(regs))
f = fullregex.match('\t'.join(testline).rstrip())
print 'F', f, '!'
Is there an issue with my fullregex or are the matches I'm getting just false positives?
Upvotes: 1
Views: 105
Reputation: 1123510
Your individual regular expressions can match within a column, but your joined expression won't.
For example, the input:
['1.1 bar', 'hello', '123 is a number']
matches your individual expressions, but when joined with tabs in between no longer matches the fullregex
pattern.
Add [^\t]*
as 'padding' into your expressions to allow for extra not-tab characters around the tabs:
fullregex = re.compile(r'[^\t]*\t[^\t]*'.join(regs))
Demo:
>>> import re
>>> regs = ['\d.\d','[a-z][a-z]','\d{1,2}']
>>> testline = ['1.1 bar', 'hello', '123 is a number']
>>> for i, (expr, line) in enumerate(zip(regs, testline)):
... print i, line, re.match(expr, line)
...
0 1.1 bar <_sre.SRE_Match object at 0x101c17e68>
1 hello <_sre.SRE_Match object at 0x101c17e68>
2 123 is a number <_sre.SRE_Match object at 0x101c17e68>
>>> re.match('\t'.join(regs), '\t'.join(testline))
>>> re.match(r'[^\t]*\t[^\t]*'.join(regs), '\t'.join(testline))
<_sre.SRE_Match object at 0x101c17e68>
If you wanted your regular expressions to validate the whole column, then either:
test each column with zip()
and anchors (^
and $
):
all(re.match('^{}$'.format(expr), col) for expr, col in zip(regs, testline))
where all()
only returns True
if all expressions matched and the anchors make sure the expression covers the whole column, not just a subset.
or join the columns with \t
, but do add anchors as well:
re.match('^{}$'.format('\t'.join(regs)), '\t'.join(testline))
Demo:
>>> regs = ['\d.\d','[a-z][a-z]','\d{1,2}']
>>> testline = ['1.1', 'he', '12']
>>> all(re.match('^{}$'.format(expr), col) for expr, col in zip(regs, testline))
True
>>> re.match('^{}$'.format('\t'.join(regs)), '\t'.join(testline))
<_sre.SRE_Match object at 0x101caa780>
>>> testline = ['1.1 bar', 'hello', '123 is a number']
>>> all(re.match('^{}$'.format(expr), col) for expr, col in zip(regs, testline))
False
>>> re.match('^{}$'.format('\t'.join(regs)), '\t'.join(testline)) is None
True
Upvotes: 1