Reputation: 35
I'm trying to match the below data in such a way that I can extract the text between the timecodes.
subs='''
1
00:00:00,130 --> 00:00:01,640
where you there when it
happened?
Who else saw you?
2
00:00:01,640 --> 00:00:03,414
This might be your last chance to
come clean. Take it or leave it.
'''
Regex=re.compile(r'(\d\d:\d\d\:\d\d,\d\d\d) --> (\d\d:\d\d\:\d\d,\d\d\d)(\n.+)((\n)?).+')
My regex matches the first line of timecode and the first line of text but only returns a few characters from the second line instead of the entire second line. How could I get it to match everything between the out time code and in-time code?
Upvotes: 1
Views: 301
Reputation: 163362
You could also get the matches without using DOTALL.
Match the timecode and capture in group 1 matching all the following lines that do not start with the timecode using a negative lookahead.
^\d{2}:\d{2}:\d{2},\d+ --> \d{2}:\d{2}:\d{2},\d+((?:\r?\n(?!\d{2}:\d{2}:\d{2},\d+ --> \d{2}:\d{2}:\d{2},\d).*)*)
In parts
^
Start of string\d{2}:\d{2}:\d{2},\d+ --> \d{2}:\d{2}:\d{2},\d+
Match the timecode pattern(
Capture group 1
(?:
Non capturing group
\r?\n
Match a newlien(?!\d{2}:\d{2}:\d{2},\d+ --> \d{2}:\d{2}:\d{2},\d)
Negative lookahead, assert not the timecode.*
Match any char except a newline 0+ times)*
Close noncapturing group and repeat 0+ times)
Close capturing group 1Upvotes: 1
Reputation: 1348
I'm not sure, but I think a solution below is more applicable for your case...
※ Using solution below, you will be able to not only extract the text between the time-codes, but also connect the text to time-code.
import re
multiline_text=\
"""
1 00:00:00,130 --> 00:00:01,640
where you there when it happened?
Who else saw you?
2 00:00:01,640 --> 00:00:03,414
This might be your last chance to
come clean. Take it or leave it.
"""
lines = multiline_text.split('\n')
dict = {}
current_key = None;
for line in lines:
is_key_match_obj = re.search('([\d\:\,]{12})(\s-->\s)([\d\:\,]{12})', line)
if is_key_match_obj:
current_key = is_key_match_obj.group()
continue
if current_key:
if current_key in dict:
if not line:
dict[current_key] += '\n'
else:
dict[current_key] += line
else:
dict[current_key] = line
print(dict)
Upvotes: 1
Reputation: 521259
One possible issue with your current approach is that you are not using DOT ALL mode when trying to capture all content in between timestamps. I have re.search
in DOT ALL mode working:
subs="""
1 00:00:00,130 --> 00:00:01,640
where you there when it happened?
Who else saw you?
2 00:00:01,640 --> 00:00:03,414
This might be your last chance to
come clean. Take it or leave it. """
match = re.search(r'\d+ \d{2}:\d{2}:\d{2},\d+ --> \d{2}:\d{2}:\d{2},\d+\s*(.*)\d+ \d{2}:\d{2}:\d{2},\d+ --> \d{2}:\d{2}:\d{2},\d+', subs, flags=re.DOTALL)
if match:
print match.group(1)
else:
print 'NO MATCH'
This prints:
where you there when it happened?
Who else saw you?
Upvotes: 0