Reputation: 469
I am having subtitles in both srt and vtt format where I need to match and remove format specific syntax and just get clean lines with text.
I have come up with this regex:
/\n?\d*?\n?^.* --> [012345]{2}:.*$/m
sample content (mix both srt and vtt):
1
00:00:04,019 --> 00:00:07,299
line1
line2
2
00:00:07,414 --> 00:00:09,155
line1
00:00:09,276 --> 00:00:11,429
line1
00:00:11,549 --> 00:00:14,874
line1
line2
This is matching both subtitle number and timing as expected simulated in https://regex101.com/r/zRsRMR/2/
But when used in the code itself (even using directly the generated code snippet from https://regex101.com), that will only match timing, not subtitle number.
See output:
array (5)
0 => array (1)
0 => "00:00:04,019 --> 00:00:07,299
" (30)
1 => array (1)
0 => "
00:00:07,414 --> 00:00:09,155
" (31)
2 => array (1)
0 => "
00:00:09,276 --> 00:00:11,429
" (31)
3 => array (1)
0 => "
00:00:11,549 --> 00:00:14,874
" (31)
4 => array (1)
0 => "
00:00:11,549 --> 00:00:14,874
" (31)
Can be tested on: http://sandbox.onlinephpfunctions.com/code/dec294251b879144f40a6d1bdd516d2050321242
The goal is to match even the subtitle number, for example first expected match should be:
1
00:00:04,019 --> 00:00:07,299
Upvotes: 2
Views: 2193
Reputation: 4150
Vtt format can have styles. Also, people edit those files by hand and usually make different formatting mistakes (like wrong timestamp format, extra new lines, spaces...). This makes writing regexp almost impossible.
If you want to correctly parse subtitles one of the best options would be to use a library:
$srt = '
1
00:00:04,019 --> 00:00:07,299
line1
line2
';
echo Subtitles::loadFromString($srt)->content('txt');
// Output:
// line1
// line2
You can parse both .srt and .vtt files this way.
https://github.com/mantas-done/subtitles
Upvotes: 1
Reputation: 163362
You could make this part of your expression \n?\d*?\n?
an optional group to match 1+ digits followed by a newline. The character class [012345]
might also be written as [0-5]
You could update your expression to:
^(?:\d+\n)?.*\h+-->\h+[0-5]{2}:.*$
^
Start of string(?:\d+\n)?
Optional 1+ digits and newline.*\h+-->\h+ Match 0+ times any char except newline, 1+ horizontal whitespace chars,
-->` and 1+ horizontal whitespace chars[0-5]{2}:
Match 2 times 0-5.*
Match 0+ times any char except newline$
End of stringUpvotes: 2
Reputation: 27723
I'm not quite sure, if this might be what you would like to capture. However, the reason is that you may want to wrap your string with capturing groups so that to be simple to get. For instance, this expression examples how capturing groups work around your desired chars:
^([0-9]+\n|)([0-9:,->\s]+)
It may not be the way to do so, or the best expression. However, it might give you an idea to approach the problem differently.
I'm guessing that you might want to capture the datetime line and lines before that, which may or may not have a number.
This graph shows how the expression would work and you can visualize other expressions in this link:
You might want to write a script to clean your data, before sending it to RegEx engine, so that you would have a simple expression.
const regex = /^([0-9]+\n|)([0-9:,->\s]+)/mg;
const str = `1
00:00:04,019 --> 00:00:07,299
line1
line2
2
00:00:07,414 --> 00:00:09,155
line1
00:00:09,276 --> 00:00:11,429
line1
00:00:11,549 --> 00:00:14,874
line1
line2
`;
let m;
while ((m = regex.exec(str)) !== null) {
// This is necessary to avoid infinite loops with zero-width matches
if (m.index === regex.lastIndex) {
regex.lastIndex++;
}
// The result can be accessed through the `m`-variable.
m.forEach((match, groupIndex) => {
console.log(`Found match, group ${groupIndex}: ${match}`);
});
}
This may not generate your desired output, it is just an example:
$re = '/^([0-9]+\n|)([0-9:,->\s]+)/m';
$str = '1
00:00:04,019 --> 00:00:07,299
line1
line2
2
00:00:07,414 --> 00:00:09,155
line1
00:00:09,276 --> 00:00:11,429
line1
00:00:11,549 --> 00:00:14,874
line1
line2
';
preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
foreach ($matches[0] as $key => $value) {
if ($value == "") {
unset($matches[0][$key]);
} else {
$matches[0][$key] = trim($value);
}
}
var_dump($matches[0]);
This JavaScript snippet shows the performance of that expression using a simple 1-million times for
loop.
repeat = 1000000;
start = Date.now();
for (var i = repeat; i >= 0; i--) {
var string = '2 \n00:00:07,414 --> 00:00:09,155';
var regex = /(.*)([0-9:,->\s]+)/gm;
var match = string.replace(regex, "$2");
}
end = Date.now() - start;
console.log("YAAAY! \"" + match + "\" is a match 💚💚💚 ");
console.log(end / 1000 + " is the runtime of " + repeat + " times benchmark test. 😳 ");
If you wish to capture all your desired output in one variable, you can simply add a capturing group around the entire expression and then call it using $1
.
You can also add or reduce boundaries, if you might want, such as this one.
^(?:[0-9]+\n|\n)(([0-9:,]+)([\s->]+)([0-9:,]+))$
const regex = /^(?:[0-9]+\n|\n)(([0-9:,]+)([\s->]+)([0-9:,]+))$/gm;
const str = `1
00:00:04,019 --> 00:00:07,299
- cdcdc
- cddcd
2
00:00:07,414 --> 00:00:09,155
54564
00:00:09,276 --> 00:00:11,429
- 445454 - ccd
- cdscdcdcd
00:00:11,549 --> 00:00:14,874
line1
line2
`;
let m;
while ((m = regex.exec(str)) !== null) {
// This is necessary to avoid infinite loops with zero-width matches
if (m.index === regex.lastIndex) {
regex.lastIndex++;
}
// The result can be accessed through the `m`-variable.
m.forEach((match, groupIndex) => {
console.log(`Found match, group ${groupIndex}: ${match}`);
});
}
Upvotes: 4