Arun Jayapal
Arun Jayapal

Reputation: 457

re.sub in python do not always substitute the string

When I try to substitute a string with another string, it does not always happen with re.sub method.

sentence = '<date>2004/12/01</date>T09:38:27+01:00'+
           'Wed, <date>2012/9/05</date> 10:55:17 UTC %3C%3C%3C'

time_identifier = u'(?<=[\s\.,T])([\d]{2}[:]{1}[\d]{2}([:]{1}[\d]{2})*[\s\.,+]*(UTC|GMT|CEST|EDT|IST|BST)*(\d\d:\d\d)*)(?=[\s\.,T]|\Z)|'\
                  u'(?<=\A)([\d]{2}[:]{1}[\d]{2}([:]{1}[\d]{2})*[\s\.,+]*(UTC|GMT|CEST|EDT|IST|BST)*(\d\d:\d\d)*)(?=[\s\.,T]|\Z)'
time = re.search(time_identifier, sentence, flags=re.U|re.I)
    if time:
        try:
            sentence = re.sub(time.groups()[0], '<time>%s</time>'%time.groups()[0], sentence, flags=re.U|re.I)
        except:
            sentence = re.sub(time.groups()[4], '<time>%s</time>'%time.groups()[4], sentence, flags=re.U|re.I)

For the above provided example, I expect the output of the sentences to be

<date>2004/12/01</date>T<time>09:38:27+01:00<time>
Wed, <date>2012/9/05</date> <time>10:55:17 UTC</time> %3C%3C%3C

But the re.sub method do not replace "09:38:27+01:00" in the original sentence by

"<time>09:38:27+01:00</time>"

Can anyone please clarify the reason for this?

Upvotes: 0

Views: 456

Answers (2)

Vicent
Vicent

Reputation: 5452

You have a couple of problems here. First, your very complicated pattern. Second, you can't do something like:

re.sub('09:38:27+01', "<time>'09:38:27+01'</time>, s)

because due to the plus sign the string s doesn't match the pattern (I'm assuming that your groups contain the proper times) so that part of the string won't be tagged. That answers your question.

The following works with your sample data (although maybe I've over-simplified the initial pattern):

p = '((?:\\d{2}:\\d{2}:\\d{2}\\+\\d{2}:\\d{2})|(?:\\d{2}:\\d{2}:\\d{2} UTC|GMT|CEST|EDT|IST|BST))'
result = re.findall(p, s)
print result
['09:38:27+01:00', '10:55:17 UTC']
r0 = result[0]
r0 = re.sub('\+', r'\+', r0)
s = re.sub(r0, "<time>%s</time>" % result[0], s)
s = re.sub(result[1], "<time>%s</time>" % result[1], s)
print s
'<date>2004/12/01</date>T<time>09:38:27+01:00</time>Wed, <date>2012/9/05</date> <time>10:55:17 UTC</time> %3C%3C%3C'

Hope it helps.

Upvotes: 1

Martijn Pieters
Martijn Pieters

Reputation: 1121594

Your expressions are terribly over-complicated. The following is a simplification that matches the exact same patterns:

time_identifier = u'(?:(?<=[\s\.,T])|\A)(\d\d:\d\d(:\d\d)*[\s\.,+]*(UTC|GMT|CEST|EDT|IST|BST)*(\d\d:\d\d)*)(?=[\s\.,T]|\Z)'

Your time strings are not being matched because of the look-ahead assertion (the (?=[\s\.,T]|\Z) part); it limits matches to anything that is followed by whitespace, a full stop, a comma, a letter T or the end of the string. Your first string is followed immediately by Wed in the sentence; there is no whitespace.

The following sentence value does match:

sentence = ('<date>2004/12/01</date>T09:38:27+01:00 '
            'Wed, <date>2012/9/05</date> 10:55:17 UTC %3C%3C%3C')

Note the extra space after the timezone.

Upvotes: 3

Related Questions