Ashwin Rao
Ashwin Rao

Reputation: 87

split and group a string based on pattern in python

Problem : I have following sample strings:

ex1 = "00:03:34 hello!! this is example number 1 00:04:00"
ex2 = "00:07:08 Hi I am example number 2"

I want it grouped like below (output) :

ex1 out : ("00:03:34", "hello!! this is example number 1", "00:04:00")
ex2 out : ("00:07:08", "Hi I am example number 2", None)

Tries :

I ve tried re split :

time_pat = r"(\d{2}:\d{2}:\d{2})"
re.split(time_pat, ex1)
re.split(time_pat, ex2)

it gives me following output:

ex1 out : ['', '00:03:34', ' hello!! this is example number 1 ', '00:04:00', '']
ex2 out : ['', '00:07:08', ' Hi I am example number 2']

I will get rid of blanks using filter and the output will then look like

ex1 out : ['00:03:34', ' hello!! this is example number 1 ', '00:04:00']
ex2 out : ['00:07:08', ' Hi I am example number 2']

The problem here is ex2 output will be of length 2 not 3, with the 3rd elemet as None. I know if the length is of 2, I can append None But I dont want to do that and I believe regular expression can do that.

I ve tried the following regular expressions:

re1 : r"(\d{2}:\d{2}:\d{2})(.*)(\d{2}:\d{2}:\d{2})"

as quite obvious, it will parse ex1 but not ex2

re2 : r"(\d{2}:\d{2}:\d{2})(.*)(\d{2}:\d{2}:\d{2})?"

this will parse both but 3rd string is always None since ".*" in regular expression consumes the end time pattern.

I ve tried lookahead assertion but I mite have tried it wrong thus giving no result. Can anybody help me get the regular expression here?

Upvotes: 2

Views: 656

Answers (2)

jedwards
jedwards

Reputation: 30250

You could use lookaheads like you suggest, or you could just use non-greedy capturing, an optional group and specify that you want to match until the end of the line ($):

import re

ex1 = "00:03:34 hello!! this is example number 1 00:04:00"
ex2 = "00:07:08 Hi I am example number 2"

for ex in [ex1, ex2]:
    mat = re.match(r'(\d{2}:\d{2}:\d{2})\s(.*?)\s*(\d{2}:\d{2}:\d{2})?$', ex)
    if mat: print mat.groups()

Output:

('00:03:34', 'hello!! this is example number 1', '00:04:00')
('00:07:08', 'Hi I am example number 2', None)

Note: This is very close to what you had -- I just used non-greedy capturing for the middle group (the ? in (.*?)) and added a $ at the end to tell it to match the entire line. Without non-greedy capturing, your optional timestamp at the end would get eaten by the middle group, and without specifying that you want to match until the end of the line, the parser wouldn't even try to match the non-greedy middle group and optional timestamp since it didn't have to.

Upvotes: 3

alpha bravo
alpha bravo

Reputation: 7948

use this pattern to capture instead of split

^(\d{2}:\d{2}:\d{2})(.*?)((?:\d{2}:\d{2}:\d{2})|)$

Demo

Upvotes: 0

Related Questions