re.sub in python do not always substitute the string

Question

When I try to substitute a string with another string, it does not always happen with re.sub method.

sentence = '2004/12/01T09:38:27+01:00'+
           'Wed, 2012/9/05 10:55:17 UTC %3C%3C%3C'

time_identifier = u'(?<=[\s\.,T])([\d]{2}[:]{1}[\d]{2}([:]{1}[\d]{2})*[\s\.,+]*(UTC|GMT|CEST|EDT|IST|BST)*(\d\d:\d\d)*)(?=[\s\.,T]|\Z)|'\
                  u'(?<=\A)([\d]{2}[:]{1}[\d]{2}([:]{1}[\d]{2})*[\s\.,+]*(UTC|GMT|CEST|EDT|IST|BST)*(\d\d:\d\d)*)(?=[\s\.,T]|\Z)'
time = re.search(time_identifier, sentence, flags=re.U|re.I)
    if time:
        try:
            sentence = re.sub(time.groups()[0], '%s'%time.groups()[0], sentence, flags=re.U|re.I)
        except:
            sentence = re.sub(time.groups()[4], '%s'%time.groups()[4], sentence, flags=re.U|re.I)

For the above provided example, I expect the output of the sentences to be

2004/12/01T09:38:27+01:00
Wed, 2012/9/05 10:55:17 UTC %3C%3C%3C

But the re.sub method do not replace "09:38:27+01:00" in the original sentence by

"09:38:27+01:00"

Can anyone please clarify the reason for this?

Martijn Pieters · Accepted Answer

Your expressions are terribly over-complicated. The following is a simplification that matches the exact same patterns:

time_identifier = u'(?:(?<=[\s\.,T])|\A)(\d\d:\d\d(:\d\d)*[\s\.,+]*(UTC|GMT|CEST|EDT|IST|BST)*(\d\d:\d\d)*)(?=[\s\.,T]|\Z)'

Your time strings are not being matched because of the look-ahead assertion (the (?=[\s\.,T]|\Z) part); it limits matches to anything that is followed by whitespace, a full stop, a comma, a letter T or the end of the string. Your first string is followed immediately by Wed in the sentence; there is no whitespace.

The following sentence value does match:

sentence = ('2004/12/01T09:38:27+01:00 '
            'Wed, 2012/9/05 10:55:17 UTC %3C%3C%3C')

Note the extra space after the timezone.

re.sub in python do not always substitute the string

Answers (2)

Related Questions