sanoop
sanoop

Reputation: 35

Regex pattern only matching half way through

I'm trying to match the below data in such a way that I can extract the text between the timecodes.

subs='''

1
00:00:00,130 --> 00:00:01,640

where you there when it 
happened?

Who else saw you?

2
00:00:01,640 --> 00:00:03,414


This might be your last chance to


come clean. Take it or leave it.
'''

Regex=re.compile(r'(\d\d:\d\d\:\d\d,\d\d\d) --> (\d\d:\d\d\:\d\d,\d\d\d)(\n.+)((\n)?).+')

My regex matches the first line of timecode and the first line of text but only returns a few characters from the second line instead of the entire second line. How could I get it to match everything between the out time code and in-time code?

Upvotes: 1

Views: 301

Answers (3)

The fourth bird
The fourth bird

Reputation: 163362

You could also get the matches without using DOTALL.

Match the timecode and capture in group 1 matching all the following lines that do not start with the timecode using a negative lookahead.

^\d{2}:\d{2}:\d{2},\d+ --> \d{2}:\d{2}:\d{2},\d+((?:\r?\n(?!\d{2}:\d{2}:\d{2},\d+ --> \d{2}:\d{2}:\d{2},\d).*)*)

In parts

  • ^ Start of string
  • \d{2}:\d{2}:\d{2},\d+ --> \d{2}:\d{2}:\d{2},\d+ Match the timecode pattern
  • ( Capture group 1
    • (?: Non capturing group
      • \r?\n Match a newlien
      • (?!\d{2}:\d{2}:\d{2},\d+ --> \d{2}:\d{2}:\d{2},\d) Negative lookahead, assert not the timecode
      • .* Match any char except a newline 0+ times
    • )* Close noncapturing group and repeat 0+ times
  • ) Close capturing group 1

Regex demo

Upvotes: 1

cxↄ
cxↄ

Reputation: 1348

I'm not sure, but I think a solution below is more applicable for your case...
※ Using solution below, you will be able to not only extract the text between the time-codes, but also connect the text to time-code.

import re

multiline_text=\
"""

1 00:00:00,130 --> 00:00:01,640

where you there when it happened?

Who else saw you?

2 00:00:01,640 --> 00:00:03,414

This might be your last chance to

come clean. Take it or leave it.
"""

lines = multiline_text.split('\n')
dict = {}
current_key = None;

for line in lines:
  is_key_match_obj = re.search('([\d\:\,]{12})(\s-->\s)([\d\:\,]{12})', line)
  if is_key_match_obj:
    current_key = is_key_match_obj.group()
    continue

  if current_key:
    if current_key in dict:
      if not line:
        dict[current_key] += '\n'
      else:
        dict[current_key] += line
    else:
      dict[current_key] = line

print(dict)

Upvotes: 1

Tim Biegeleisen
Tim Biegeleisen

Reputation: 521259

One possible issue with your current approach is that you are not using DOT ALL mode when trying to capture all content in between timestamps. I have re.search in DOT ALL mode working:

subs="""

1 00:00:00,130 --> 00:00:01,640

where you there when it happened?

Who else saw you?

2 00:00:01,640 --> 00:00:03,414

This might be your last chance to

come clean. Take it or leave it. """
match = re.search(r'\d+ \d{2}:\d{2}:\d{2},\d+ --> \d{2}:\d{2}:\d{2},\d+\s*(.*)\d+ \d{2}:\d{2}:\d{2},\d+ --> \d{2}:\d{2}:\d{2},\d+', subs, flags=re.DOTALL)
if match:
    print match.group(1)
else:
    print 'NO MATCH'

This prints:

where you there when it happened?

Who else saw you?

Upvotes: 0

Related Questions