What Regex to use in this example

Question

I am parsing a string that I know will definitely only contain the following distinct phrases that I want to parse:

'Man of the Match'
'Goal'
'Assist'
'Yellow Card'
'Red Card'

The string that I am parsing could contain everything from none of the elements above to all of them (i.e. the string being parsed could be anything from None to 'Man of the Match Goal Assist Yellow Card Red Card'.

For those of you that understand football, you will also realise that the elements 'Goal' and 'Assist' could in theory be repeated an infinite number of times. The element 'Yellow Card' could be repeated 0, 1 or 2 times also.

I have built the following Regex (where 'incident1' is the string being parsed), which I believed would return an unlimited number of all preceding Regexes, however all I am getting is single instances:

regex1 = re.compile("Man of the Match*", re.S)
regex2 = re.compile("Goal*", re.S)
regex3 = re.compile("Assist*", re.S)
regex4 = re.compile("Red Card*", re.S)
regex5 = re.compile("Yellow Card*", re.S)

mysearch1 = re.search(regex1, incident1)
mysearch2 = re.search(regex2, incident1)
mysearch3 = re.search(regex3, incident1)
mysearch4 = re.search(regex4, incident1)
mysearch5 = re.search(regex5, incident1)

#print mystring
print "incident1 = ", incident1
if mysearch1 is not None:
    print "Man of the match = ", mysearch1.group()
if mysearch2 is not None:    
    print "Goal = ", mysearch2.group()
if mysearch3 is not None:
    print "Assist = ", mysearch3.group()
if mysearch4 is not None:    
    print "Red Card = ", mysearch4.group()
if mysearch5 is not None:
    print "Yellow Card = ", mysearch5.group()

This works as long as there is only one instance of every element encountered in a string, however if a player was for example to score more than one goal, this code only returns one instance of 'Goal'.

Can anyone see what I am doing wrong?

Adam Smith · Accepted Answer

Some general overview on regex methods in python:

re.search | re.match

In your previous attempt, you tried to use re.search. This only returned one result, and as you'll see this isn't unusual. These two functions are used to identify if a line contains a certain regex. You'd use these for something like:

s = subprocess.check_output('ipconfig') # calls ipconfig and sends output to s
for line in s.splitlines():
    if re.search("\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}", str(line)):
        # if line contains an IP address...
        print(line)

You use re.match to specifically check if the regex matches at the BEGINNING of the string. This is usually used with a regex that matches the WHOLE string. For example:

lines = ['Adam Smith, Age: 24, Male, Favorite Thing: Reading page: 16',
         'Adam Smith, Age: 16, Male, Favorite Thing: Being a regex example']
# two Adams, but we only want the one who is 16 years old.
repattern = re.compile(r'''Adam \w+, Age: 16, (?:Male|Female), Favorite Thing: [^,]*?''')
for line in lines:
    if repattern.match(line):
        print(line)
        # Adam Smith, Age: 16, Male, Favorite Thing: Being a regex example
# note if we'd used re.search for Age: 16, it would have found both lines!

The take away is that you use these two functions to select lines in a longer document (or any iterable)

re.findall | re.finditer

It seems in this case, you aren't trying to match a line, you're trying to pull some specifically-formatted information from the string. Let's see some examples of that.

s = """Phone book:
Adam: (555)123-4567
Joe:  (555)987-6543
Alice:(555)135-7924"""

pat = r'''(?:$\d{3}$)?\d{3}-?\d{4}'''
phone_numbers = re.findall(pat, s)
print(phone_numbers)
# ['(555)123-4567','(555)987-6543','(555)135-7924']

re.finditer returns a generator instead of a list. You'd use this the same way you'd use xrange instead of range in Python2. re.findall(some_pattern, some_string) can make a GIANT list if there are a TON of matches. re.finditer will not.

other methods: re.split | re.sub

re.split is great if you have a number of things you need to split by. Imagine you had the string:

s = '''Hello, world! It's great that you're talking to me, and everything, but I'd really rather you just split me on punctuation marks. Okay?'''

There's no great way to do that with str.split like you're used to, so instead do:

separators = [".", "!", "?", ","]
splitpattern = '|'.join(map(re.escape, separators))
# re.escape takes a string and escapes out any characters that regex considers
#   special, for instance that . would otherwise be "any character"!
split_s = re.split(splitpattern, s)
print(split_s)
# ['Hello', ' world', " It's great that you're talking to me", ' and everything', " but I'd really rather you just split me on punctuation marks", ' Okay', '']

re.sub is great in cases where you know something will be formatted regularly, but you're not sure exactly how. However, you REALLY want to make sure they're all formatted the same! This will be a little advanced and use several methods, but stick with me....

dates = ['08/08/2014', '09-13-2014', '10.10.1997', '9_29_09']
separators = list()
new_sep = "/"
match_pat = re.compile(r'''
    \d{1,2}              # two digits
    (.)                  # followed by a separator (capture)
    \d{1,2}              # two more digits
    \1                   # a backreference to that separator
    \d{2}(?:\d{2})?      # two digits and optionally four digits''', re.X)
for idx,date in enumerate(dates):
    match = match_pat.match(date)
    if match:
        sep = match.group(1) # the separator
        separators.append(sep)
    else:
        dates.pop(idx) # this isn't really a date, is it?
repl_pat = '|'.join(map(re.escape, separators))
final_dates = re.sub(repl_pat, new_sep, '\n'.join(dates))
print(final_dates)
# 08/08/2014
# 09/13/2014
# 10/10/1997
# 9/29/09

A slightly less advanced example, you can use re.sub with any sort of formatted expression and pass it a function to return! For instance:

def get_department(dept_num):
    departments = {'1': 'I.T.',
                   '2': 'Administration',
                   '3': 'Human Resources',
                   '4': 'Maintenance'}
    if hasattr(dept_num, 'group'): # then it's a match, not a number
        dept_num = dept_num.group(0)
    return departments.get(dept_num, "Unknown Dept")

file = r"""Name,Performance Review,Department
Adam,3,1
Joe,5,2
Alice,1,3
Eve,12,4""" # this looks like a csv file

dept_names = re.sub(r'''\d+$''', get_department, file, flags=re.M)
print(dept_names)
# Name,Performance Review,Department
# Adam,3,I.T.
# Joe,5,Administration
# Alice,1,Human Resources
# Eve,12,Maintenance

Without using regex here you could do:

replaced_lines = []
departments = {'1': 'I.T.',
               '2': 'Administration',
               '3': 'Human Resources',
               '4': 'Maintenance'}
for line in file.splitlines():
    the_split_line = line.split(',')
    replaced_lines.append(','.join(the_split_line[:-1]+ \
                                   departments.get(the_split_line[-1], "Unknown Dept")))
new_file = '\n'.join(replaced_lines)
# LOTS OF STRING MANIPULATION, YUCK!

Instead we replace all that for loop and string splitting, list slicing, and string manipulation with a function and a re.sub call. In fact, if you use a lambda it's even easier!

departments = {'1': 'I.T.',
               '2': 'Administration',
               '3': 'Human Resources',
               '4': 'Maintenance'}
re.sub(r'''\d+$''', lambda x: departments.get(x, "Unknown Dept"), file, flags=re.M)
# DONE!

What Regex to use in this example

Answers (2)

re.search | re.match

re.findall | re.finditer

other methods: re.split | re.sub

Related Questions