Reputation: 371

using \b in regex

--SOLVED-- I solved my issue by enabling multiline mode, and now the characters ^ and $ work perfectly for identifying the beginning and end of each string

--EDIT--

My code:

import re
import test_regex


def regex_content(text_content, regex_dictionary):

#text_content = text_content.lower()
regex_matches = []

# Search sanitized text (markup removed) for DLP theme keywords
for key,value in regex_dictionary.items():

  # Get confiiguration settings
  min_matches = value.get('min_matches',1)
  risk = value.get('risk',1)
  enabled = value.get('enabled',False)
  regex_str = value.get('regex','')

  # Fast compute True/False hit for each DLP theme word
  if enabled:
    print "Searching for key : %s" % (key)
    my_regex = re.compile(value.get('regex'))
    hits = my_regex.findall(text_content)

    if len(hits) > 0:
      regex_matches.append((key, risk, len(hits), hits))

# Return array of results (key, risk, number of hits, regex matches)
return regex_matches

def main():


    #print defaults.test_regex.dlp_regex

    text_content = ""

    for line in open('testData.txt'):
        text_content+=line

    for match in regex_content(text_content, test_regex.dlp_regex):
        print "\nFound %s : %s" % (match[0], match[3])

    print "\n"

if __name__ == '__main__':
main()

and it is using the regex found here:

'Large number of US Zip Codes' : { 'regex' : "\b\d{5}(?:-\d{1,4})?\b"},

When I precede my regex with the 'r' flag, I can find the zip codes I'm looking for, but as well as every other 5 digit number in my document I am searching through. From my understanding this is because it ignored the \b characters. Without the r flag though, it cannot find any zip codes. It works perfectly fine in regexr, but not in my code. I haven't had any luck making \b characters work, nor ^ and $ for identifying the beginnings and ends of the strings I'm searching for. What is it that I am misunderstanding about these special characters?

--Original post--

I am writing a regex for identifying zip codes (and only zip codes), so to avoid false positives I am trying to include a boundary on my regex, using both of the following:

\b\d{5}\b|\b\d{5}-\b\d{1,4}\b

using the online regex debugger Regexr, my code should correctly catch 5 digit zip codes, such as 34332. However, I have two problems:
1. This regex is not working in my actual code for finding any zip codes, but it does work when I don't have the boundary (\b) characters. The exact code I'm trying to extract with my regex is:

Zip: 
----
98839-0111
34332

2. I don't see why my regex can't correctly identify 98839-0111 in Regexr. I tried doing the super-primitive approach of

\b\d{5}\b|98839-0111

and even that couldn't identify 98839-0111. Does anyone know what could be going on?

Note: I have also tried using ^ and $ for the boundaries of my regex, but this also doesn't find the regex's, not even in Regexr.

EDIT: After removing the first part of my regex, leaving only

98839-0111

It can now correctly identify it. I guess this means that once a string is pulled out by one of my regex's, it can no longer be found by any subsequent regexs? Why is this?

Upvotes: 4

using \b in regex

Answers (3)

Related Questions