Reputation: 469

RegEx matching for SRT and VTT syntax from subtitles

I am having subtitles in both srt and vtt format where I need to match and remove format specific syntax and just get clean lines with text.

I have come up with this regex: /\n?\d*?\n?^.* --> [012345]{2}:.*$/m

sample content (mix both srt and vtt):

1
00:00:04,019 --> 00:00:07,299
line1
line2

2
00:00:07,414 --> 00:00:09,155
line1

00:00:09,276 --> 00:00:11,429
line1

00:00:11,549 --> 00:00:14,874
line1
line2

This is matching both subtitle number and timing as expected simulated in https://regex101.com/r/zRsRMR/2/

But when used in the code itself (even using directly the generated code snippet from https://regex101.com), that will only match timing, not subtitle number.

See output:

array (5)
0 => array (1)
0 => "00:00:04,019 --> 00:00:07,299
" (30)
1 => array (1)
0 => "
00:00:07,414 --> 00:00:09,155
" (31)
2 => array (1)
0 => "
00:00:09,276 --> 00:00:11,429
" (31)
3 => array (1)
0 => "
00:00:11,549 --> 00:00:14,874
" (31)
4 => array (1)
0 => "
00:00:11,549 --> 00:00:14,874
" (31)

Can be tested on: http://sandbox.onlinephpfunctions.com/code/dec294251b879144f40a6d1bdd516d2050321242

The goal is to match even the subtitle number, for example first expected match should be:

1
00:00:04,019 --> 00:00:07,299

Upvotes: 2

Answers (3)

Mantas D

Reputation: 4150

Vtt format can have styles. Also, people edit those files by hand and usually make different formatting mistakes (like wrong timestamp format, extra new lines, spaces...). This makes writing regexp almost impossible.

If you want to correctly parse subtitles one of the best options would be to use a library:

$srt = '
   1
   00:00:04,019 --> 00:00:07,299
   line1
   line2

';
echo Subtitles::loadFromString($srt)->content('txt'); 
// Output: 
// line1
// line2

You can parse both .srt and .vtt files this way.

https://github.com/mantas-done/subtitles

Upvotes: 1

The fourth bird

Reputation: 163362

You could make this part of your expression \n?\d*?\n? an optional group to match 1+ digits followed by a newline. The character class [012345] might also be written as [0-5]

You could update your expression to:

^(?:\d+\n)?.*\h+-->\h+[0-5]{2}:.*$

^ Start of string
(?:\d+\n)? Optional 1+ digits and newline
.*\h+-->\h+ Match 0+ times any char except newline, 1+ horizontal whitespace chars,-->` and 1+ horizontal whitespace chars
[0-5]{2}: Match 2 times 0-5
.* Match 0+ times any char except newline
$ End of string

Regex demo | Php demo

Upvotes: 2

Emma

Reputation: 27723

I'm not quite sure, if this might be what you would like to capture. However, the reason is that you may want to wrap your string with capturing groups so that to be simple to get. For instance, this expression examples how capturing groups work around your desired chars:

^([0-9]+\n|)([0-9:,->\s]+)

It may not be the way to do so, or the best expression. However, it might give you an idea to approach the problem differently.

I'm guessing that you might want to capture the datetime line and lines before that, which may or may not have a number.

Graph

This graph shows how the expression would work and you can visualize other expressions in this link:

You might want to write a script to clean your data, before sending it to RegEx engine, so that you would have a simple expression.

Example Test with JavaScript

const regex = /^([0-9]+\n|)([0-9:,->\s]+)/mg;
const str = `1
00:00:04,019 --> 00:00:07,299
line1
line2

2
00:00:07,414 --> 00:00:09,155
line1

00:00:09,276 --> 00:00:11,429
line1

00:00:11,549 --> 00:00:14,874
line1
line2
`;
let m;

while ((m = regex.exec(str)) !== null) {
    // This is necessary to avoid infinite loops with zero-width matches
    if (m.index === regex.lastIndex) {
        regex.lastIndex++;
    }
    
    // The result can be accessed through the `m`-variable.
    m.forEach((match, groupIndex) => {
        console.log(`Found match, group ${groupIndex}: ${match}`);
    });
}

PHP Test

This may not generate your desired output, it is just an example:

$re = '/^([0-9]+\n|)([0-9:,->\s]+)/m';
$str = '1
00:00:04,019 --> 00:00:07,299
line1
line2

2
00:00:07,414 --> 00:00:09,155
line1

00:00:09,276 --> 00:00:11,429
line1

00:00:11,549 --> 00:00:14,874
line1
line2
';

preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);

foreach ($matches[0] as $key => $value) {
    if ($value == "") {
        unset($matches[0][$key]);
    } else {
        $matches[0][$key] = trim($value);
    }

}

var_dump($matches[0]);

Performance Test

This JavaScript snippet shows the performance of that expression using a simple 1-million times for loop.

repeat = 1000000;
start = Date.now();

for (var i = repeat; i >= 0; i--) {
	var string = '2  \n00:00:07,414 --> 00:00:09,155';
	var regex = /(.*)([0-9:,->\s]+)/gm;
	var match = string.replace(regex, "$2");
}

end = Date.now() - start;
console.log("YAAAY! \"" + match + "\" is a match 💚💚💚 ");
console.log(end / 1000 + " is the runtime of " + repeat + " times benchmark test. 😳 ");

If you wish to capture all your desired output in one variable, you can simply add a capturing group around the entire expression and then call it using $1.

You can also add or reduce boundaries, if you might want, such as this one.

^(?:[0-9]+\n|\n)(([0-9:,]+)([\s->]+)([0-9:,]+))$

Example Test with JavaScript for second expression

const regex = /^(?:[0-9]+\n|\n)(([0-9:,]+)([\s->]+)([0-9:,]+))$/gm;
const str = `1
00:00:04,019 --> 00:00:07,299
- cdcdc
- cddcd

2
00:00:07,414 --> 00:00:09,155
54564

00:00:09,276 --> 00:00:11,429
- 445454 - ccd
- cdscdcdcd

00:00:11,549 --> 00:00:14,874
line1
line2
`;
let m;

while ((m = regex.exec(str)) !== null) {
    // This is necessary to avoid infinite loops with zero-width matches
    if (m.index === regex.lastIndex) {
        regex.lastIndex++;
    }
    
    // The result can be accessed through the `m`-variable.
    m.forEach((match, groupIndex) => {
        console.log(`Found match, group ${groupIndex}: ${match}`);
    });
}

Upvotes: 4