Reputation: 3782
Basic Question:
How can you name a python regex group with another group value and nest this within a larger regex group?
Origin of Question:
Given a string such as 'Your favorite song is 1 hour 23 seconds long. My phone only records for 1 h 30 mins and 10 secs.'
What is an elegant solution for extracting the times and converted to a given unit?
Attempted Solution:
My best guess at a solution would be to create a dictionary and then perform operations on the dictionary to convert to the desired unit.
i.e. convert the given string to this:
string[0]:
{'time1': {'day':0, 'hour':1, 'minutes':0, 'seconds':23, 'milliseconds':0}, 'time2': {'day':0, 'hour':1, 'minutes':30, 'seconds':10, 'milliseconds':0}}
string[1]:
{'time1': {'day':4, 'hour':2, 'minutes':3, 'seconds':6, 'milliseconds':30}}
I have a regex solution, but it isn't doing what I would like:
import re
test_string = ['Your favorite song is 1 hour 23 seconds long. My phone only records for 1h 30 mins and 10 secs.',
'This video is 4 days 2h 3min 6sec 30ms']
year_units = ['year', 'years', 'y']
day_units = ['day', 'days', 'd']
hour_units = ['hour', 'hours', 'h']
min_units = ['minute', 'minutes', 'min', 'mins', 'm']
sec_units = ['second', 'seconds', 'sec', 'secs', 's']
millisec_units = ['millisecond', 'milliseconds', 'millisec', 'millisecs', 'ms']
all_units = '|'.join(year_units + day_units + hour_units + min_units + sec_units + millisec_units)
print((all_units))
# pattern = r"""(?P<time> # time group beginning
# (?P<value>[\d]+) # value of time unit
# \s* # may or may not be space between digit and unit
# (?P<unit>%s) # unit measurement of time
# \s* # may or may not be space between digit and unit
# )
# \w+""" % all_units
pattern = r""".*(?P<time> # time group beginning
(?P<value>[\d]+) # value of time unit
\s* # may or may not be space between digit and unit
(?P<unit>%s) # unit measurement of time
\s* # may or may not be space between digit and unit
).* # may be words in between the times
""" % (all_units)
regex = re.compile(pattern)
for val in test_string:
match = regex.search(val)
print(match)
print(match.groupdict())
This fails miserably due to not being able to properly deal with nested groupings and not being able to assign a name with the value of a group.
Upvotes: 2
Views: 144
Reputation: 43266
First of all, you can't just write a multiline regex with comments and expect it to match anything if you don't use the re.VERBOSE
flag:
regex = re.compile(pattern, re.VERBOSE)
Like you said, the best solution is probably to use a dict
for val in test_string:
while True: #find all times
match = regex.search(val) #find the first unit
if not match:
break
matches= {} # keep track of all units and their values
while True:
matches[match.group('unit')]= int(match.group('value')) # add the match to the dict
val= val[match.end():] # remove part of the string so subsequent matches must start at index 0
m= regex.search(val)
if not m or m.start()!=0: # if there are no more matches or there's text between this match and the next, abort
break
match= m
print matches # the finished dict
# output will be like {'h': 1, 'secs': 10, 'mins': 30}
However, the code above won't work just yet. We need to make two adjustments:
The pattern cannot allow just any text between matches. To allow only whitespace and the word "and" between two matches, you can use
pattern = r"""(?P<time> # time group beginning
(?P<value>[\d]+) # value of time unit
\s* # may or may not be space between digit and unit
(?P<unit>%s) # unit measurement of time
\s* # may or may not be space between digit and unit
(?:\band\s+)? # allow the word "and" between numbers
) # may be words in between the times
""" % (all_units)
You have to change the order of your units like so:
year_units = ['years', 'year', 'y'] # yearS before year
day_units = ['days', 'day', 'd'] # dayS before day, etc...
Why? Because if you have a text like 3 years and 1 day
, then it would match 3 year
instead of 3 years and
.
Upvotes: 1