Reputation: 371
--SOLVED-- I solved my issue by enabling multiline mode, and now the characters ^ and $ work perfectly for identifying the beginning and end of each string
--EDIT--
My code:
import re
import test_regex
def regex_content(text_content, regex_dictionary):
#text_content = text_content.lower()
regex_matches = []
# Search sanitized text (markup removed) for DLP theme keywords
for key,value in regex_dictionary.items():
# Get confiiguration settings
min_matches = value.get('min_matches',1)
risk = value.get('risk',1)
enabled = value.get('enabled',False)
regex_str = value.get('regex','')
# Fast compute True/False hit for each DLP theme word
if enabled:
print "Searching for key : %s" % (key)
my_regex = re.compile(value.get('regex'))
hits = my_regex.findall(text_content)
if len(hits) > 0:
regex_matches.append((key, risk, len(hits), hits))
# Return array of results (key, risk, number of hits, regex matches)
return regex_matches
def main():
#print defaults.test_regex.dlp_regex
text_content = ""
for line in open('testData.txt'):
text_content+=line
for match in regex_content(text_content, test_regex.dlp_regex):
print "\nFound %s : %s" % (match[0], match[3])
print "\n"
if __name__ == '__main__':
main()
and it is using the regex found here:
'Large number of US Zip Codes' : { 'regex' : "\b\d{5}(?:-\d{1,4})?\b"},
When I precede my regex with the 'r' flag, I can find the zip codes I'm looking for, but as well as every other 5 digit number in my document I am searching through. From my understanding this is because it ignored the \b characters. Without the r flag though, it cannot find any zip codes. It works perfectly fine in regexr, but not in my code. I haven't had any luck making \b characters work, nor ^ and $ for identifying the beginnings and ends of the strings I'm searching for. What is it that I am misunderstanding about these special characters?
--Original post--
I am writing a regex for identifying zip codes (and only zip codes), so to avoid false positives I am trying to include a boundary on my regex, using both of the following:
\b\d{5}\b|\b\d{5}-\b\d{1,4}\b
using the online regex debugger Regexr, my code should correctly catch 5 digit zip codes, such as 34332. However, I have two problems:
1. This regex is not working in my actual code for finding any zip codes, but it does work when I don't have the boundary (\b) characters. The exact code I'm trying to extract with my regex is:
Zip:
----
98839-0111
34332
2. I don't see why my regex can't correctly identify 98839-0111 in Regexr. I tried doing the super-primitive approach of
\b\d{5}\b|98839-0111
and even that couldn't identify 98839-0111. Does anyone know what could be going on?
Note: I have also tried using ^ and $ for the boundaries of my regex, but this also doesn't find the regex's, not even in Regexr.
EDIT: After removing the first part of my regex, leaving only
98839-0111
It can now correctly identify it. I guess this means that once a string is pulled out by one of my regex's, it can no longer be found by any subsequent regexs? Why is this?
Upvotes: 4
Views: 2479
Reputation: 626920
It is because of the alternative list: the first part was matched, and the engine stopped checking.
Try this regex
98839-0111|\b\d{5}\b
And you'll get a match.
Or, to be more generic in your case:
\b(?:\d{5}-\d{4}|\d{5})\b
will match both, and more (actually, functionally the same as \b\d{5}(?:-\d{4})?\b
). See demo.
Upvotes: 6
Reputation: 89557
Your pattern is evaluated for each position in the string from the left to the right, so if the left branch of your pattern succeeds, the second branch isn't tested at all.
I suggest you to use this pattern that solves the problem:
\b\d{5}(?:-\d{1,4})?\b
Upvotes: 0